Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address the high computational overhead and redundant reasoning chains inherent in Self-Consistency (SC) during test-time scaling, this paper proposes a dynamic pruning method grounded in *thought-level semantic similarity*. Unlike prior approaches relying on model confidence scores or heuristic rules, our method theoretically and empirically identifies the root cause of SC inefficiency and, during generation, measures semantic similarity among reasoning paths layer-wise to adaptively prune redundant branches. The method is architecture-agnostic—compatible with mainstream large language models—and offers strong interpretability. Experiments across three STEM benchmarks and two prominent LLMs demonstrate that applying our R1-Distill strategy reduces inference latency by up to 45%, decreases KV cache memory usage by 26%, while maintaining or even improving accuracy.

Technology Category

Application Category

📝 Abstract

Recently, Test-Time Scaling (TTS) has gained increasing attention for improving LLM reasoning performance at test time without retraining the model. A notable TTS technique is Self-Consistency (SC), which generates multiple reasoning chains in parallel and selects the final answer via majority voting. While effective, the order-of-magnitude computational overhead limits its broad deployment. Prior attempts to accelerate SC mainly rely on model-based confidence scores or heuristics with limited empirical support. For the first time, we theoretically and empirically analyze the inefficiencies of SC and reveal actionable opportunities for improvement. Building on these insights, we propose Slim-SC, a step-wise pruning strategy that identifies and removes redundant chains using inter-chain similarity at the thought level. Experiments on three STEM reasoning datasets and two recent LLM architectures show that Slim-SC reduces inference latency and KVC usage by up to 45% and 26%, respectively, with R1-Distill, while maintaining or improving accuracy, thus offering a simple yet efficient TTS alternative for SC.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead of Self-Consistency scaling

Pruning redundant reasoning chains for efficiency

Maintaining accuracy while decreasing latency and resource usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-wise pruning strategy for redundancy removal

Inter-chain similarity analysis at thought level

Reduces latency and KVC usage significantly

🔎 Similar Papers

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers