TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address redundancy, overthinking, and underthinking induced by test-time Chain-of-Thought (CoT) scaling in Large Reasoning Models (LRMs), this paper proposes a training-free, verifier-driven dynamic CoT compression framework. The method employs a lightweight pre-trained verifier to identify and prune invalid or redundant reasoning steps in real time, integrating cognitive-inspired design principles with numerical optimization techniques to enable asynchronous, high-throughput industrial deployment. It requires no model fine-tuning and is compatible with Ascend NPUs and the vLLM inference engine. Evaluated on four major benchmarks—including MATH500—the framework achieves up to 70% inference speedup on models such as Pangu-R-38B, with negligible accuracy degradation (<0.3%). This significantly improves the efficiency–accuracy trade-off inherent in CoT scaling, offering a practical, scalable solution for production-grade LRM reasoning.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methods, such as prolonging CoT with explicit token-level exploration, can push LRMs' accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generate redundant thinking CoTs, which demonstrate clear structured overthinking and underthinking patterns. Inspired by human cognitive reasoning processes and numerical optimization theories, we propose TrimR, a verifier-based, training-free, efficient framework for dynamic CoT compression to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment. Our method employs a lightweight, pretrained, instruction-tuned verifier to detect and truncate redundant intermediate thoughts of LRMs without any LRM or verifier fine-tuning. We present both the core algorithm and asynchronous online system engineered for high-throughput industrial applications. Empirical evaluations on Ascend NPUs and vLLM show that our framework delivers substantial gains in inference efficiency under large-batch workloads. In particular, on the four MATH500, AIME24, AIME25, and GPQA benchmarks, the reasoning runtime of Pangu-R-38B, QwQ-32B, and DeepSeek-R1-Distill-Qwen-32B is improved by up to 70% with negligible impact on accuracy.
Problem

Research questions and friction points this paper is trying to address.

Reduces redundant Chain-of-Thought reasoning in Large Reasoning Models
Improves test-time scaling efficiency without fine-tuning models
Enhances inference speed with minimal accuracy impact
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free dynamic CoT compression framework
Lightweight pretrained verifier truncates redundant thoughts
Asynchronous online system for high-throughput deployment