Scaling Test-Time Compute for Agentic Coding

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing test-time scaling approaches struggle to effectively manage the complex trajectories generated by long-horizon code-producing agents and lack efficient mechanisms for representing and reusing past trial experiences. This work proposes a test-time scaling framework tailored for intelligent code agents, introducing structured trajectory summarization to preserve critical hypotheses, progress indicators, and failure patterns. The framework further incorporates Recursive Tournament Voting (RTV) and Parallel Distillation Refinement (PDR) mechanisms to enable efficient serial and parallel inference scaling. Evaluated on SWE-Bench Verified and Terminal-Bench v2.0, the approach substantially improves state-of-the-art model performance—for instance, boosting Claude-4.5-Opus from 70.9% to 77.6% on SWE-Bench Verified and from 46.9% to 59.1% on Terminal-Bench v2.0.

Technology Category

Application Category

📝 Abstract

Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

agentic coding

long-horizon agents

rollout trajectories

inference-time scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling

agentic coding

trajectory summarization