When we compress reasoning models, they find the right answer and then talk themselves out of it.
MATH-500 problem: largest 8-digit base-2 integer, in base 10. Correct answer: 255.
Figure 1. Schematic. Bar widths are proportional to mean MATH-500 chain length on DeepSeek-R1-Distill-Qwen-1.5B (BF16: 5.2K, 3-bit AWQ: 23.4K, +Penalty: 12.9K tokens). The tick position is illustrative; the paper does not measure per-position answer-emission times. See the inline animation in the next section for token-level traces.
TL;DR. Post-training quantization (PTQ) makes reasoning models more compact and faster to deploy. It also produces, in our experiments, much longer chains of thought and lower accuracy on reasoning benchmarks. We perform a token-level analysis to identify one mechanism behind both effects, which we call overthinking: the quantized model reaches the correct answer in an intermediate reasoning step, then opens new branches, second-guesses itself, and never commits. Token-level KL divergence between the full-precision and quantized models concentrates on hesitation tokens (“Wait”, “But”, “Alternatively”), at the decoding positions where the full-precision model was already uncertain. A simple decode-time penalty on a curated set of 50 hesitation tokens shortens chains by 12 to 23% on average across models and benchmarks, while preserving or improving accuracy. We do not claim overthinking is the only effect of PTQ on reasoning models. We claim it is a specific, measurable, and addressable one.
We evaluate five reasoning models (DeepSeek-R1-Distill-Qwen 1.5B, 7B, 14B, DeepSeek-R1-Distill-Llama 8B, and QwQ-32B), three PTQ methods (GPTQ, AWQ, FlatQuant), and five benchmarks (AIME-120, MATH-500, GSM8K, GPQA-Diamond, LiveCodeBench). Mild quantization is essentially free. Once we push to 3-bit weights or W4A4KV4, accuracy drops and chains of thought grow at the same time. The figure below shows the effect for one model, three methods, on MATH-500, and how a simple decode-time penalty restores most of the lost accuracy and shortens chains substantially.


Figure 2. DeepSeek-R1-Distill-Qwen-1.5B on MATH-500. Gray bars: no penalty. Green bars: with the overthinking penalty (best $\lambda$ per configuration). 3-bit AWQ drops accuracy from 85.6% to 47.0% and inflates CoT from 5.2K to 23.4K tokens. The penalty recovers 14.2 points of accuracy and cuts CoT length by 45%. Full per-model, per-benchmark results in Section 4 of the paper.
Below, three decoders run the same MATH-500 problem (excerpts from real traces). All three reach Answer: 255. Only two stop near it.
Excerpts from MATH-500 traces. Hesitation markers in red. The penalty suppresses their logits, so they fail to win the argmax and the model commits.
Across all 28 model and quantization pairs, the Spearman correlation between accuracy loss and CoT length increase is ρ = -0.73. The models that lose the most accuracy also generate the longest chains. This is suggestive but not diagnostic on its own. Longer chains could be a symptom of degraded capability, or they could actively cause the degradation. To distinguish these, we look at the errors directly.
We define an overthinking error as a failure where the model reaches the correct answer at some point in its chain of thought, then opens new reasoning paths, questions its own assumptions, or reverses the correct conclusion before producing a final answer. We annotated a subset of MATH-500 failures by hand and used GPT-5 (calibrated to over 95% agreement with our annotation) to scale labels into four buckets: overthinking, logical error (wrong plan from the start), arithmetic error (right plan, wrong computation), and formatting / hallucination / other.
Figure 3. Error breakdown on MATH-500 for DeepSeek-R1-Distill-Qwen-1.5B. Before the penalty, overthinking errors balloon under quantization: 19 (BF16) → 64 (FlatQuant) → 139 (AWQ 3-bit). After the penalty, overthinking drops to 13 / 34 / 58 respectively. The penalty reduces overthinking errors most dramatically for the most aggressive quantization (a 58% drop for AWQ 3-bit), while leaving logical and arithmetic errors roughly unchanged.
The error categories do not grow in proportion. Overthinking errors inflate disproportionately under quantization. Arithmetic and logical errors stay roughly flat in absolute count. The longer chains under PTQ are not just slower correct reasoning. They contain qualitatively different failures, in which the answer was reached and then abandoned.
What we claim and what we do not. Overthinking is one effect of PTQ on reasoning models, isolated through our error analysis. Quantization may degrade reasoning in other ways we do not measure here (changes to retrieval, attention behavior, calibration on harder problems). The contribution of this work is to identify this one mode, characterize its token-level signature, and show that it can be addressed with a targeted decode-time intervention. Other errors are not addressed by our method.
To localize the source of the overthinking errors, we run BF16 and 3-bit AWQ on the same MATH-500 prompts under identical generation prefixes and measure per-position KL divergence $D_{\mathrm{KL}}(p_t \,\|\, q_t)$ between the BF16 and quantized next-token distributions. KL divergence is a standard measure of how much one distribution differs from another. A large KL at position $t$ means the two models disagree sharply about what to say next at that position. We then associate each KL value with the token the quantized model actually sampled, and average across all positions where each unique token appears (with a minimum count of 50).
What we mean by a high-entropy position. At a high-entropy position, the full-precision model spreads probability across many roughly comparable next-token candidates rather than concentrating it on one. The model has not made up its mind about what to say next. Low-entropy positions are the opposite: one token (a digit, an operator, the next step of an arithmetic chain) dominates and the rest of the distribution is essentially zero. This distinction matters because, as we show below, quantization disrupts the two kinds of positions very differently.
Two things happen at once at high-entropy positions. First, the logit gap between the top token and a hesitation token like “Wait” is small, so a modest perturbation to the logits can flip which token is sampled. Second, hesitation tokens are 2 to 4× more likely to appear among the top-20 candidates at high-entropy positions than at low-entropy ones (see Section 6 of the paper for the per-bin breakdown). Quantization noise has its largest effect exactly at these uncertain positions, so the quantized model is more likely to sample a “Wait” or “But”, open a new reasoning branch, and overwrite the answer it had already found.
If hesitation tokens at high-entropy positions really drive the overthinking errors, suppressing them at decode time should shorten chains, shrink the overthinking error bucket specifically, and leave the other error categories alone. We curate a set $\mathcal S$ of 50 overthinking markers (“Wait”, “But”, “Alternatively”, “perhaps”, “maybe”, “however”, “reconsider”, “backtrack”, “wrong”, ...; the full list is in the paper) and subtract a fixed penalty $\lambda > 0$ from their logits at every decoding step:
We use this penalty as a diagnostic for the mechanism we identified, not as a proposed inference method. If the diagnosis is right, suppressing exactly these tokens should produce a specific, predictable change in the overthinking error count. Suppressing other token lists (random tokens, lowest-KL math tokens) should produce nothing or the opposite. Both controls are run below.
We sweep $\lambda \in [0.5, 4.0]$ over four token lists of equal size (50 tokens each): our overthinking markers, the 50 highest-KL tokens, the 50 lowest-KL tokens, and 50 randomly selected tokens. Results are averaged over six quantization configurations and five benchmarks on Qwen-1.5B. We focus on chain-of-thought length and the efficiency–accuracy Pareto frontier. Accuracy is preserved or improved on average across this sweep.
Overthinking markers. Shorten CoT by 12 to 23% at every $\lambda$ we tried. The list occupies the upper-left of the Pareto frontier across all $\lambda$ values. Accuracy is preserved or improved on average.
Random tokens. Negligible effect on CoT length or accuracy. The configuration stays near the no-penalty baseline.
Lowest-KL math tokens. CoT grows by up to 41% and accuracy drops by up to 9.5%. Suppressing the tokens that compute the answer hurts reasoning, as expected.
A symmetric sanity check (middle figure above): boosting the same overthinking markers with a negative $\lambda$ inflates CoT by up to 445% and drops accuracy by up to 34%. Random and lowest-KL tokens stay flat under boosting. The lever points in both directions when we expect it to, which closes the loop on the mechanism: the markers we identified are exactly the tokens that control reasoning length, and they control it monotonically in both directions.
The effect generalizes. Across all five models and all PTQ settings, the penalty reduces CoT length by 4.1 to 28.0% on average, with accuracy preserved or improved. The Pareto frontier shifts toward shorter chains in every configuration we tested. The overthinking error bucket on MATH-500 under 3-bit AWQ drops from 139 to 58 without inflating the other error categories, confirming the bucket the penalty was designed to target. The same intervention shortens chains in full-precision models by 5.6 to 10.0% as well, so overthinking is a tendency already present in reasoning models that quantization amplifies.
Caveats. The set $\mathcal S$ is curated for English reasoning models. The penalty $\lambda$ is fixed per decoding step rather than adapted to local entropy. Our evaluation is restricted to math, coding, and science benchmarks. Whether the same overthinking signature appears in agentic or open-domain reasoning is an open question.
@article{lotfi2026quantized,
title = {Quantized Reasoning Models Think They Need to Think Longer, but They Do Not},
author = {Lotfi, Sanae and Kirichenko, Polina and Li, Steven and Liu, Zechun},
year = {2026},
}