🤖 AI Summary
This work addresses the inefficiency of existing reasoning methods that uniformly allocate computational resources, leading to over-sampling on easy problems and insufficient exploration of hard ones. The authors propose the Uncertainty-Aware Budgeting (UAB) framework, which dynamically allocates a fixed sampling budget by first estimating problem difficulty via the average negative log-likelihood (ANLL) from a single generation and then applying a marginal greedy algorithm to prioritize high-uncertainty samples for additional budget. Notably, UAB achieves precise difficulty awareness using only output log-probabilities—without auxiliary models or extra inference overhead—and enables efficient adaptive allocation through a concave coverage maximization surrogate objective combined with integer optimization. Evaluated across six language models (1.5B–27B parameters) and five reasoning benchmarks, UAB yields average accuracy gains of 3%, with up to 5% improvement on specific tasks, particularly excelling in low-resource settings.
📝 Abstract
Sampling multiple responses improves language model reasoning, but uniform compute allocation is inefficient: easy questions are over-sampled while hard questions remain under-explored. We propose Uncertainty-Aware Budget Allocation (UAB), a concave integer optimization framework that reallocates a fixed sampling budget based on per-question uncertainty estimated at no additional inference cost. In Phase 1, every question receives one generation; its average negative log-likelihood (ANLL), extracted directly from output log-probabilities, serves as a difficulty signal while the generation contributes to the final vote. In Phase 2, the remaining budget is allocated by a marginal-greedy algorithm that solves a concave coverage-maximization surrogate exactly: uncertain questions receive more sampling budget while confident questions receive fewer additional samples. Evaluated on six open-weight and black-box models spanning 1.5B to 27B parameters and five reasoning benchmarks covering math, logic, and preference tasks, UAB outperforms baselines by up to +3% in average accuracy and up to +5% on individual benchmarks, with the largest gains in low-resource settings, requiring no auxiliary model or additional LLM call. Code is publicly available at https://github.com/manhitv/UAB.