Strategic Scaling of Test-Time Compute: A Bandit Learning Approach

📅 2025-06-15

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing large language models (LLMs) employ static computational resource allocation during inference, ignoring query difficulty heterogeneity—leading to inefficient resource utilization and suboptimal performance. This work is the first to formulate test-time compute scheduling as a **difficulty-aware multi-armed bandit (MAB) problem**, enabling online estimation of query solvability and dynamic budget allocation. Our method integrates real-time difficulty estimation with a bandit-driven scheduling mechanism, achieving adaptive, low-overhead compute allocation tailored to individual query complexity. Evaluated on mathematical reasoning (MATH-500) and code generation (LiveCodeBench) benchmarks, our approach improves absolute accuracy by 11.10% (+15.04% relative) and 7.41% (+14.40% relative), respectively, while significantly enhancing compute efficiency—demonstrating superior accuracy-per-compute performance over fixed-budget baselines.

Technology Category

Application Category

📝 Abstract

Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query difficulty. To address this inefficiency, we formulate test-time compute allocation as a novel bandit learning problem and propose adaptive algorithms that estimate query difficulty on the fly and allocate compute accordingly. Compared to uniform allocation, our algorithms allocate more compute to challenging queries while maintaining accuracy on easier ones. Among challenging queries, our algorithms further learn to prioritize solvable instances, effectively reducing excessive computing on unsolvable queries. We theoretically prove that our algorithms achieve better compute efficiency than uniform allocation and empirically validate their effectiveness on math and code benchmarks. Specifically, our algorithms achieve up to an 11.10% performance improvement (15.04% relative) on the MATH-500 dataset and up to a 7.41% performance improvement (14.40% relative) on LiveCodeBench.

Problem

Research questions and friction points this paper is trying to address.

Optimizing test-time compute allocation for varying query difficulty

Adaptively prioritizing solvable queries to reduce wasted computation

Improving efficiency and performance on math and code benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bandit learning for dynamic compute allocation

Adaptive algorithms prioritize challenging queries

Theoretical and empirical compute efficiency proofs

🔎 Similar Papers

Cost-Efficient Online Decision Making: A Combinatorial Multi-Armed Bandit Approach