Chronos: Learning Temporal Dynamics of Reasoning Chains for Test-Time Scaling

πŸ“… 2026-02-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitation of existing test-time scaling methods, which treat all reasoning trajectories or tokens uniformly and thus struggle to account for variations in trajectory quality and localized logical errors. To overcome this, we propose Chronosβ€”a lightweight, plug-and-play temporal reasoning scorer that, for the first time, models reasoning chains as time series. Chronos dynamically evaluates trajectory quality at the token level through probability-based scoring and applies weighted voting accordingly. By introducing a temporal dynamic scoring mechanism, Chronos achieves substantial performance gains on benchmarks such as HMMT25: Chronos@128 improves over Pass@1 by 34.21% and surpasses Maj@128 by 22.70%, all while incurring minimal computational overhead.

Technology Category

Application Category

πŸ“ Abstract
Test-Time Scaling (TTS) has emerged as an effective paradigm for improving the reasoning performance of large language models (LLMs). However, existing methods -- most notably majority voting and heuristic token-level scoring -- treat reasoning traces or tokens equally, thereby being susceptible to substantial variations in trajectory quality and localized logical failures. In this work, we introduce \textbf{Chronos}, a lightweight and plug-and-play chronological reasoning scorer that models each trajectory as a time series. Specifically, Chronos learns to capture trajectory features of token probabilities, assigns quality scores accordingly, and employs a weighted voting mechanism. Extensive evaluations on both in-domain and out-of-domain benchmarks demonstrate that Chronos consistently delivers substantial gains across a variety of models, with negligible computational overhead. Notably, Chronos@128 achieves relative improvements of 34.21\% over Pass@1 and 22.70\% over Maj@128 on HMMT25 using Qwen3-4B-Thinking-2507, highlighting its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Test-Time Scaling
reasoning chains
trajectory quality
logical failures
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Scaling
Chronological Reasoning
Time Series Modeling
Weighted Voting
Trajectory Quality Scoring
πŸ”Ž Similar Papers
No similar papers found.