🤖 AI Summary
This work identifies a prevalent inefficiency in large language models (LLMs) during inference: frequent self-verification steps that rarely detect errors, leading to substantial redundant computation. To address this, the authors propose a dynamic suppression mechanism grounded in an offline experience pool. By retrieving historical verification outcomes, the method assesses whether a current verification step is necessary and selectively skips redundant checks at test time. Evaluated across multiple models and benchmarks, this approach reduces token consumption by up to 20.3% while maintaining or even slightly improving reasoning accuracy, thereby significantly enhancing inference efficiency.
📝 Abstract
Large Reasoning Models (LRMs) achieve strong performance by generating long reasoning traces with reflection. Through a large-scale empirical analysis, we find that a substantial fraction of reflective steps consist of self-verification (recheck) that repeatedly confirm intermediate results. These rechecks occur frequently across models and benchmarks, yet the vast majority are confirmatory rather than corrective, rarely identifying errors and altering reasoning outcomes. This reveals a mismatch between how often self-verification is activated and how often it is actually useful. Motivated by this, we propose a novel, experience-driven test-time framework that reduces the overused verification. Our method detects the activation of recheck behavior, consults an offline experience pool of past verification outcomes, and estimates whether a recheck is likely unnecessary via efficient retrieval. When historical experience suggests unnecessary, a suppression signal redirects the model to proceed. Across multiple model and benchmarks, our approach reduces token usage up to 20.3% while maintaining the accuracy, and in some datasets even yields accuracy improvements.