Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limitations of conventional test-time scaling methods that rely on generation length, which fail in tool-intensive scenarios where tool-induced latency decouples inference time from output length and hinders adaptive strategy adjustment under time budgets. To overcome this, the authors propose the Timely Machine framework, which redefines test-time scaling in terms of real clock time and introduces a time-aware agent reasoning paradigm. By combining supervised fine-tuning for cold-start initialization with Timely-RL—a tailored reinforcement learning approach—the model dynamically allocates interaction resources across varying tool latencies and time constraints. Evaluation on the newly constructed Timely-Eval benchmark demonstrates significant improvements in temporal awareness and planning: smaller models excel under low latency through high-frequency interactions, while larger models leverage superior interaction quality to outperform under high latency.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based definition breaks down: tool latency decouples inference time from generation length. We propose Timely Machine, redefining test-time as wall-clock time, where models dynamically adjust strategies based on time budgets. We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning. By varying tool latency, we find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Moreover, existing models fail to adapt reasoning to time budgets. We propose Timely-RL to address this gap. After cold-start supervised fine-tuning, we use reinforcement learning to enhance temporal planning. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval. We hope our work offers a new perspective on test-time scaling for the agentic era.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

agentic reasoning

tool latency

time budget awareness

wall-clock time

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling

wall-clock time

tool-augmented reasoning