Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

MATH-500 problem: What is the largest eight-digit base 2 integer? Express your answer in base 10. Correct answer: 255.

BF16full precision

~5.2K tokCOMMITS

3 bit AWQno penalty

~23.4K tokKEEPS GOING

3 bit AWQ+ penalty

~12.9K tokCOMMITS

Green tick: first occurrence of the correct answer token sequence. red marks: overthinking markers sampled after the correct answer was reached.

Figure 1. Quantization amplifies overthinking, and a logit penalty on overthinking markers fixes it without retraining. A quantized reasoning model can reach a correct intermediate answer, self-doubt, open a new reasoning branch, and never produce the final response. We apply a logit penalty to overthinking markers like “but” and “wait”. The penalty shifts the accuracy versus reasoning cost Pareto frontier in every configuration we test, with accuracy preserved or improved on average. Bar widths are proportional to mean MATH-500 CoT length on DeepSeek-R1-Distill-Qwen-1.5B (BF16: 5.2K, 3 bit AWQ: 23.4K, +Penalty: 12.9K tokens).

TL;DR. Across math, coding, and science QA, we find that aggressive post training quantization (PTQ) reduces accuracy and increases chain-of-thought (CoT) length together (Spearman $\rho = -0.73$ across 28 model and quantization pairs). In up to 52% of quantized model failures, the model reaches the correct answer mid reasoning and then talks itself out of it. A token level KL divergence analysis shows that quantization most affects overthinking markers (“wait”, “but”, “alternatively”) at high entropy positions, while mathematical tokens are nearly unchanged. We use this to build a training free logit penalty on 50 curated overthinking markers that shortens CoT length by 12 to 23% across 5 models (1.5B to 32B), 3 quantization methods, and 5 benchmarks, with a better accuracy versus reasoning cost Pareto frontier. The same penalty also reduces BF16 CoT length by 5.6 to 10.0%: quantization amplifies a failure mode already present in full precision reasoning models rather than introducing a new one.

Accuracy drops while CoT length increases under PTQ

We evaluate five reasoning specialized models spanning 1.5B to 32B parameters (DeepSeek-R1-Distill-Qwen 1.5B, 7B, 14B, DeepSeek-R1-Distill-Llama 8B, and QwQ-32B), three PTQ methods (GPTQ, AWQ, FlatQuant), and five benchmarks across mathematics, coding, and science (AIME-120, MATH-500, GSM8K, GPQA-Diamond, LiveCodeBench). Mild quantization (FlatQuant W8A8KV8, 4 bit weight only AWQ and GPTQ) is largely benign: accuracy and CoT length both stay close to the BF16 baseline. As precision drops further, accuracy and reasoning efficiency degrade together. The figure below shows this for DeepSeek-R1-Distill-Qwen-1.5B on MATH-500, and how the overthinking penalty shifts the accuracy versus reasoning cost tradeoff.

MATH-500 accuracy: BF16, FlatQuant W4A4, AWQ 3 bit, before and after the overthinking penalty

MATH-500 CoT length: BF16, FlatQuant W4A4, AWQ 3 bit, before and after the overthinking penalty

Figure 2. Quantized reasoning models produce longer CoT while achieving lower accuracy. DeepSeek-R1-Distill-Qwen-1.5B on MATH-500. Gray bars: no penalty. Green bars: with the overthinking penalty (best $\lambda$ per configuration). 3 bit AWQ reduces accuracy from 85.6% to 47.0% while increasing the average CoT from 5.2K to 23.4K tokens, a 4.5× increase. The penalty improves accuracy by 14.2 points and reduces CoT length by 45%. Full model and benchmark results are in Section 4 of the paper.

A qualitative example from MATH-500

The three generations below illustrate the overthinking failure mode. The BF16 model follows a concise solution. The 3 bit quantized model reaches the correct intermediate result, then samples repeated overthinking markers such as “wait” and “but”, which open alternative reasoning paths and delay termination. The penalized model reduces the probability of those tokens and terminates after reaching the answer. All three reach Answer: 255; only two commit to it.

BF16 (full precision)

~5.2K tokens, terminates

Let x = largest 8-digit base-2 integer
= 2^8 - 1
= 256 - 1
Answer: 255
<done>

state: DONEtokens: 5,156

3 bit AWQ

~23.4K tokens, reaches context limit

Let x = 2^8 - 1 = 256 - 1
Answer: 255.
Wait, maybe 11111111 is 8 bits, so 2^8 means 256... no, it's 255.
But what if the phrasing means 2^9-1?
Alternatively, 0s and 1s could be...
Hmm, let me reconsider. 11111111 binary is 128+64+32+16+8+4+2+1 = 255. So actually...

state: STILL GOINGtokens: 23,444

3 bit AWQ + penalty

~12.9K tokens, terminates

Let x = 2^8 - 1
= 256 - 1
Answer: 255.
Wait, maybe ✂
But what if ✂
Alternatively ✂
<done>

state: DONEtokens: 12,873

Excerpts from MATH-500 traces. Overthinking markers are shown in red. The penalty subtracts a fixed value from the logits of these tokens at every decoding step, reducing the probability that the model opens a new reasoning branch after it has already reached the answer.

Across all 28 model and quantization pairs, the Spearman correlation between accuracy degradation and CoT length increase is ρ = -0.73. The models that lose the most accuracy are the same ones that produce the longest reasoning traces. PTQ reduces accuracy and increases CoT length at the same time, which suggests that the extra reasoning is part of how quantization hurts accuracy and not just a side effect of it.

Quantization increases overthinking errors

To see which errors quantization amplifies, we categorize incorrect answers from DeepSeek-R1-Distill-Qwen-1.5B on MATH-500. We first annotate a subset of errors manually, then use GPT-5 as a judge to scale the categorization, tuning the prompt to reach over 95% agreement with human annotation. We assign each failure to exactly one of: (i) Overthinking: the model reaches the correct solution at some point in its chain-of-thought but does not commit to it as the final answer. Instead, the model opens new reasoning paths, excessively questions its own assumptions, or reverses a correct conclusion. (ii) Logical error: the model follows an incorrect plan or misunderstands the problem from early on, such that the reasoning trajectory is wrong before any correct solution is reached. (iii) Arithmetic error: the overall approach and plan are correct, but the model makes concrete computational mistakes that lead to a wrong final answer. (iv) Formatting, hallucination, and other errors: off format answers, hallucinated constraints, or other failures not covered above.

MATH-500 error breakdown across BF16, FlatQuant W4A4, and AWQ 3 bit, before and after the overthinking penalty

Figure 3. Quantization increases overthinking errors, and penalizing overthinking markers reduces them. Error breakdown on MATH-500 for DeepSeek-R1-Distill-Qwen-1.5B. Before the penalty, overthinking errors increase sharply under quantization: 19 (BF16) → 64 (FlatQuant W4A4KV4) → 139 (AWQ 3 bit). Under 3 bit AWQ, overthinking becomes the dominant failure mode at 52% of all 265 errors, a 7.3× increase in absolute count over BF16. Penalizing overthinking markers and high KL tokens consistently reduces overthinking errors by up to 58% on AWQ 3 bit while improving accuracy by up to 14.2%. Penalizing random or low KL tokens does not reduce overthinking errors and can even increase such errors.

The longer reasoning traces produced by quantized models are not slower correct reasoning. They contain excessive hesitation and self-doubt that actively turns correct intermediate answers into incorrect final answers, while the other error categories grow much less. Quantization can also hurt reasoning in other ways tied to capability loss, but we focus on overthinking because, like PTQ itself, we can address it entirely at inference time, with no training and no computational overhead.

Quantized models diverge from full precision ones at branching positions

To understand how quantization increases overthinking errors, we measure token level KL divergence between the output distributions of the BF16 and 3 bit AWQ models. We treat the full precision model as a reference and run both models on the same MATH-500 prompts under identical generation prefixes to isolate the effects of quantization. At each decoding position $t$, we compute the KL divergence $D_{\mathrm{KL}}(p_t \,\|\, q_t) = \sum_{v \in \mathcal V} p_t(v) \log \frac{p_t(v)}{q_t(v)}$, where $p_t$ and $q_t$ are the next token distributions of the full precision and quantized models. A large KL at position $t$ means the two models disagree sharply about what to say next. We then associate each KL value with the token the quantized model sampled at that position and compute, for each unique token in the vocabulary, its average KL across all positions where it was sampled, filtering to tokens appearing at least 50 times.

What we mean by a high entropy position. At a high entropy position, the full precision model spreads probability across many roughly comparable next token candidates rather than concentrating it on one. Low entropy positions are the opposite: one token, such as a digit, operator, or the next step of an arithmetic chain, dominates the distribution. This distinction matters because quantization most affects positions where the reference model is already locally uncertain, with a small logit margin.

Top 20 tokens with the highest KL divergence between BF16 and 3 bit AWQ; these are overthinking markers

Highest KL tokens are overthinking markers.

Top 20 tokens with the lowest KL divergence; these are mathematical and formatting tokens, all with KL well under 0.02

Lowest KL tokens are mathematical and formatting tokens. Note the x axis only reaches 0.06.

Position level KL divergence vs. BF16 next token entropy; correlation rho = 0.92

Position level KL vs. BF16 next token entropy. Spearman ρ = 0.92.

Quantization disproportionately affects overthinking markers at high entropy positions

The tokens where the quantized and full precision models diverge the most are overthinking markers such as “Wait”, “But”, “Alternatively”, “if”, “maybe”, “verify”, “think”. Many of these open a new reasoning path and signal the model's hesitation.
The tokens where the two models agree most closely are mathematical and formatting tokens. Quantization barely changes how the model samples these execution tokens. Note the x axis range: low KL values are nearly 100× smaller than high KL values (the largest low KL value is around 0.015, while the largest high KL value is around 1.30).
Position level KL divergence correlates strongly with the BF16 model's next token entropy (Spearman ρ = 0.92), confirming that quantization most affects positions where the model is already uncertain.

Two things happen at once at high entropy positions. First, the logit gap between the top token and an overthinking marker like “Wait” is small, so even a moderate perturbation to the logits can flip which token is sampled. Second, overthinking markers are 2 to 4× more likely to appear among the top 20 most probable tokens at high entropy positions than at low entropy ones. Because these positions are inherently uncertain, quantization noise has a larger effect on the output distribution, and we are more likely to see the quantized model sample a “Wait” or “But”, open a new reasoning branch, and overwrite the correct intermediate answer it had already found.

We also compute the density of high KL and low KL tokens relative to total CoT length. In BF16, low KL math tokens appear at a higher rate than high KL overthinking markers (ratio 0.57). Under 3 bit AWQ, this ratio flips to 1.15: high KL tokens outnumber low KL tokens, so the quantized model produces more overthinking markers per unit of reasoning than mathematical content. This shift in token composition is how quantization leads to less efficient reasoning.

Penalizing overthinking markers

The analysis above shows that quantization disproportionately affects overthinking markers, while leaving mathematical tokens largely unchanged. We now use this to build an inference time intervention that reduces the probability of these markers during decoding. We curate a set $\mathcal S$ of 50 overthinking markers: tokens that signal hesitation, self-doubt, or a switch to a new reasoning branch (“Wait”, “But”, “Alternatively”, “perhaps”, “maybe”, “however”, “reconsider”, “backtrack”, “wrong”, …; the full list is in the paper). At each decoding step $t$, we modify the logits $z_t(v)$ for each token $v \in \mathcal S$ by subtracting a fixed penalty $\lambda > 0$:

$$ z'_t(v) = z_t(v) - \lambda \;\;\text{if}\;\; v \in \mathcal S, \quad z'_t(v) = z_t(v) \;\;\text{otherwise.} $$

The penalty adds zero computational overhead and has a single hyperparameter $\lambda$. To check that the effect is specific to overthinking markers and not a generic artifact of touching the logits, we run controlled ablations with three other token lists of the same size (50 tokens each): the 50 highest KL tokens between BF16 and 3 bit AWQ, the 50 lowest KL tokens, and 50 randomly selected tokens. We apply the logit penalty to each list on Qwen-1.5B across six quantization configurations and five benchmarks, sweeping $\lambda$ from 0.5 to 4.0.

CoT length vs. positive penalty strength for four token lists

Positive penalty: shortens CoT.

CoT length vs. negative penalty strength (boosting overthinking markers)

Negative penalty: increases CoT (directionality check).

Accuracy vs. length Pareto frontier across all penalty strengths

Pareto frontier: overthinking markers occupy the upper left.

Overthinking markers. Shorten CoT by 12 to 23% at every $\lambda$, with accuracy preserved or slightly improved. The list occupies the upper left region of the Pareto frontier for all $\lambda$ values, the best accuracy versus reasoning cost tradeoff among the four lists.

Random tokens. Near zero effect on both accuracy and CoT length: the configuration stays on top of the no-penalty baseline. Suppressing arbitrary tokens does nothing on its own.

Lowest KL math tokens. Catastrophic in the opposite direction: CoT grows by up to 41% and accuracy drops by up to 9.5%. These are the tokens that carry the computational content of reasoning, and they are not the ones responsible for the post-quantization degradation.

A symmetric directionality check (middle figure above): boosting the same overthinking markers with a negative $\lambda$ increases CoT by up to 445% and drops accuracy by up to 34%. High KL tokens behave similarly; random and lowest KL tokens stay nearly neutral under boosting. Overthinking markers and high KL tokens control reasoning length in both directions.

The effect generalizes. Across all 5 models and all PTQ settings, the penalty shortens CoT length by 12 to 23% on average and shifts the Pareto frontier toward shorter CoT in every configuration we tested. On MATH-500 under 3 bit AWQ, overthinking errors drop from 139 to 58 (a 58% reduction) without increasing the other error categories: the accuracy gains we observe in some configurations come from this drop in overthinking errors specifically, not from a generic boost in capability. In a small number of configurations accuracy moves slightly in either direction, but CoT length always shrinks, so the Pareto frontier moves in the right direction at every $\lambda$.

The same penalty also reduces CoT length in full precision models by 5.6 to 10.0% depending on the model. Overthinking is a tendency already present in BF16 reasoning models; quantization amplifies it, and the same lightweight intervention helps in both regimes.

Contributions

We show that PTQ exacerbates overthinking in reasoning models. In up to 52% of failures under aggressive quantization, the model reaches the correct answer in intermediate steps but fails to commit to it. Overthinking errors increase up to 7.3× in absolute count on MATH-500 under 3 bit AWQ, while other error categories grow much less. Quantization does not introduce overthinking as a new failure mode but rather amplifies a tendency already present in BF16 reasoning models: the same penalty reduces BF16 CoT length by 5.6 to 10.0%.
We identify which tokens drive this failure through a KL divergence analysis between the quantized and full precision output distributions. Among the tokens with the highest divergence are overthinking markers that concentrate at high entropy positions. The tokens with the lowest divergence are mathematical tokens that carry the computational content of reasoning. Position level KL correlates with BF16 next token entropy at Spearman $\rho = 0.92$.
We introduce a very simple yet effective logit penalty on a curated set of 50 overthinking markers. The penalty is training free, adds no computational overhead, and consistently reduces CoT length by 12 to 23% on average across 5 models (1.5B to 32B) $\times$ 3 quantization methods $\times$ 5 benchmarks. We see accuracy preserved or improved, with the gains coming from up to 58% fewer overthinking errors on AWQ 3 bit. Controlled ablations confirm that penalizing overthinking markers yields the most favorable Pareto frontier of accuracy against reasoning cost: penalizing random tokens does nothing, and penalizing the lowest KL math tokens is catastrophic (CoT $+41\%$, accuracy $-9.5\%$).

Limitations and future work. Our penalty uses a fixed $\lambda$ at every decoding step; adapting $\lambda$ to local entropy is a natural extension. The overthinking markers list $\mathcal S$ is curated for English reasoning models, and we expect automatic per-model construction to make the approach more portable. Our evaluation covers math, coding, and science; whether the same signature appears in agentic or open domain reasoning is an open question. More broadly, our KL divergence analysis is not specific to quantization: any perturbation to a model's weights or activations (pruning, distillation, low rank approximation) may behave similarly at uncertain positions, so the same diagnostic could apply more broadly.