Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and latency inherent in test-time scaling for large language models (LLMs), this paper proposes a dual-system adaptive inference framework. It employs a fast “System 1” to estimate answer entropy—serving as a proxy for sample-wise expansion potential—and dynamically triggers parallelized “System 2” self-consistency generation only when beneficial. The framework integrates dynamic self-consistency with an answer-entropy–driven early-stopping mechanism, jointly optimizing inference quality and efficiency. Empirically, the method achieves up to 47% reduction in token consumption and 43% lower inference latency—without compromising accuracy—outperforming existing sequential self-consistency approaches. Notably, it is the first work to systematically incorporate cognitive dual-system theory into test-time scaling design.

Technology Category

Application Category

📝 Abstract
Test-time scaling improves the inference performance of Large Language Models (LLMs) but also incurs substantial computational costs. Although recent studies have reduced token consumption through dynamic self-consistency, they remain constrained by the high latency of sequential requests. In this paper, we propose SeerSC, a dynamic self-consistency framework that simultaneously improves token efficiency and latency by integrating System 1 and System 2 reasoning. Specifically, we utilize the rapid System 1 to compute the answer entropy for given queries. This score is then used to evaluate the potential of samples for scaling, enabling dynamic self-consistency under System 2. Benefiting from the advance and accurate estimation provided by System 1, the proposed method can reduce token usage while simultaneously achieving a significant decrease in latency through parallel generation. It outperforms existing methods, achieving up to a 47% reduction in token consumption and a 43% reduction in inference latency without significant performance loss.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in test-time scaling for LLMs
Overcoming high latency in sequential self-consistency requests
Improving token efficiency while maintaining model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates System 1 and System 2 reasoning
Uses System 1 to compute answer entropy
Enables parallel generation to reduce latency
Shiyu Ji
Shiyu Ji
University of California, Santa Barbara
Information RetrievalPrivacySecurity
Y
Yixuan Wang
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Y
Yijun Liu
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Qingfu Zhu
Qingfu Zhu
Harbin Institute of Technology
NLPCode LLM
Wanxiang Che
Wanxiang Che
Professor of Harbin Institute of Technology
Natural Language Processing