LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the limitations of existing test-time scaling (TTS) strategies, which often rely on handcrafted heuristics and struggle to efficiently explore the space of computational allocation. The authors propose AutoTTS, a framework that formulates width-depth TTS as a controller synthesis problem grounded in pre-collected inference trajectories. By introducing β-parameterization and a fine-grained probing feedback mechanism, AutoTTS significantly enhances policy search efficiency without repeatedly invoking large language models. This approach achieves the first environment-driven automatic discovery of TTS policies, outperforming strong human-designed baselines on mathematical reasoning benchmarks. Moreover, the discovered policies demonstrate robust generalization across tasks and models, with the entire discovery process costing only \$39.90 and requiring just 160 minutes.

📝 Abstract

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

large language models

heuristic design

computation allocation

reasoning strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Scaling

Automated Strategy Discovery

Controller Synthesis