🤖 AI Summary
To address the excessive inference overhead caused by redundant chain-of-thought (CoT) steps during test-time scaling of large language models (LLMs), this paper proposes the Perplexity-driven Importance Refinement (PIR) framework. PIR quantitatively models the contribution of each reasoning step to answer confidence, decoupling progressive reasoning—i.e., the core problem-solving path—from functional elements such as verification and error correction. This enables fine-grained, step-level importance assessment and selective pruning. PIR requires no additional human annotations, supports lightweight fine-tuning, and generalizes across model scales and data sources. Evaluated on benchmarks including AIME, AMC, and GPQA Diamond, PIR reduces inference tokens by 3%–41% while preserving solution completeness and improving accuracy by 0.9%–6.6%, thereby significantly enhancing both test-time inference efficiency and cross-benchmark generalization.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9% to +6.6%) with significantly reduced token usage (-3% to -41%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.