🤖 AI Summary
Existing sequential scaling methods predominantly rely on heuristic strategies, lacking theoretical guarantees that limit both performance and interpretability. This work pioneers a formal modeling of sequential scaling as a two-state Markov process, from which we derive sufficient conditions for accuracy improvement and establish provable upper and lower bounds on performance. Building upon this theoretical foundation, we develop a closed-form optimization solution that enables principle-driven inference scheduling. Evaluated across three prominent large language models, five benchmark datasets, and over twenty experimental configurations, our approach consistently outperforms existing parallel and sequential scaling strategies, achieving significant gains in both inference efficiency and accuracy.
📝 Abstract
Sequential scaling is a prominent inference-time scaling paradigm, yet its performance improvements are typically modest and not well understood, largely due to the prevalence of heuristic, non-principled approaches that obscure clear optimality bounds. To address this, we propose a principled framework that models sequential scaling as a two-state Markov process. This approach reveals the underlying properties of sequential scaling and yields closed-form solutions for essential aspects, such as the specific conditions under which accuracy is improved and the theoretical upper, neutral, and lower performance bounds. Leveraging this formulation, we develop MarkovScale, a practical system that applies these optimality criteria to achieve a theoretically grounded balance between accuracy and efficiency. Comprehensive experiments across 3 backbone LLMs, 5 benchmarks, and over 20 configurations show that MarkovScale consistently outperforms state-of-the-art parallel and sequential scaling methods, representing a significant step toward optimal and resource-efficient inference in LLMs. The source code will be open upon acceptance at https://open-upon-acceptance.