What makes Reasoning Models Different? Follow the Reasoning Leader for Efficient Decoding

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Large reasoning models (LRMs) suffer from high inference latency and “overthinking” due to excessively long chains of thought (CoT). This work identifies, for the first time, a token-level misalignment evolution pattern—characterized by global misalignment rebound and local misalignment decay (LMD). Leveraging this phenomenon, we propose FoReaL-Decoding: a novel decoding paradigm integrating sentence-level Leader-Draft coordination, stochastic gating interpolation, and CoT length compression to enable fine-grained, controllable cost–quality trade-offs. Evaluated on four major mathematical reasoning benchmarks, FoReaL-Decoding reduces theoretical FLOPs by 30–50%, shortens CoT length by 40%, and retains 86–100% of the original model’s performance. Our approach establishes an interpretable, tunable decoding framework for efficient LRM inference.

Technology Category

Application Category

📝 Abstract

Large reasoning models (LRMs) achieve strong reasoning performance by emitting long chains of thought. Yet, these verbose traces slow down inference and often drift into unnecessary detail, known as the overthinking phenomenon. To better understand LRMs' behavior, we systematically analyze the token-level misalignment between reasoning and non-reasoning models. While it is expected that their primary difference lies in the stylistic"thinking cues", LRMs uniquely exhibit two pivotal, previously under-explored phenomena: a Global Misalignment Rebound, where their divergence from non-reasoning models persists or even grows as response length increases, and more critically, a Local Misalignment Diminish, where the misalignment concentrates at the"thinking cues"each sentence starts with but rapidly declines in the remaining of the sentence. Motivated by the Local Misalignment Diminish, we propose FoReaL-Decoding, a collaborative fast-slow thinking decoding method for cost-quality trade-off. In FoReaL-Decoding, a Leading model leads the first few tokens for each sentence, and then a weaker draft model completes the following tokens to the end of each sentence. FoReaL-Decoding adopts a stochastic gate to smoothly interpolate between the small and the large model. On four popular math-reasoning benchmarks (AIME24, GPQA-Diamond, MATH500, AMC23), FoReaL-Decoding reduces theoretical FLOPs by 30 to 50% and trims CoT length by up to 40%, while preserving 86 to 100% of model performance. These results establish FoReaL-Decoding as a simple, plug-and-play route to controllable cost-quality trade-offs in reasoning-centric tasks.

Problem

Research questions and friction points this paper is trying to address.

Analyzes token-level misalignment in reasoning models

Proposes FoReaL-Decoding for efficient reasoning model inference

Reduces computational cost while preserving reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative fast-slow thinking decoding method

Leading model guides draft model per sentence

Stochastic gate balances cost-quality trade-off

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting