Rethinking Thinking Tokens: LLMs as Improvement Operators

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the high computational overhead, latency, and context redundancy induced by long chain-of-thought (CoT) reasoning in large language models (LLMs), this paper proposes the Parallel-Distill-Refine (PDR) inference framework, which models the LLM as an iterative self-improvement operator over its own reasoning traces. PDR decouples computational cost from token generation via three stages: parallel candidate generation, text workspace distillation, and conditional refinement. It further incorporates reinforcement learning to jointly optimize inference and training in an 8B-parameter model. We identify Sequential Refinement—a novel, highly efficient sub-paradigm—enabling simultaneous gains in both accuracy and efficiency, thereby establishing a new Pareto frontier. On mathematical reasoning benchmarks, PDR achieves +11% and +9% absolute accuracy improvements on AIME 2024 and 2025, respectively, while significantly reducing inference latency and context length.

Technology Category

Application Category

📝 Abstract

Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own"thoughts"with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM reasoning accuracy versus computational cost trade-offs

Developing controllable context length strategies for efficient inference

Training models to improve performance through iterative refinement techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel generation of diverse draft solutions

Distillation into bounded textual workspace

Refinement using controllable context length

🔎 Similar Papers

No similar papers found.