π€ AI Summary
This work addresses the limitation of existing test-time scaling methods, which treat all reasoning trajectories or tokens uniformly and thus struggle to account for variations in trajectory quality and localized logical errors. To overcome this, we propose Chronosβa lightweight, plug-and-play temporal reasoning scorer that, for the first time, models reasoning chains as time series. Chronos dynamically evaluates trajectory quality at the token level through probability-based scoring and applies weighted voting accordingly. By introducing a temporal dynamic scoring mechanism, Chronos achieves substantial performance gains on benchmarks such as HMMT25: Chronos@128 improves over Pass@1 by 34.21% and surpasses Maj@128 by 22.70%, all while incurring minimal computational overhead.
π Abstract
Test-Time Scaling (TTS) has emerged as an effective paradigm for improving the reasoning performance of large language models (LLMs). However, existing methods -- most notably majority voting and heuristic token-level scoring -- treat reasoning traces or tokens equally, thereby being susceptible to substantial variations in trajectory quality and localized logical failures. In this work, we introduce \textbf{Chronos}, a lightweight and plug-and-play chronological reasoning scorer that models each trajectory as a time series. Specifically, Chronos learns to capture trajectory features of token probabilities, assigns quality scores accordingly, and employs a weighted voting mechanism. Extensive evaluations on both in-domain and out-of-domain benchmarks demonstrate that Chronos consistently delivers substantial gains across a variety of models, with negligible computational overhead. Notably, Chronos@128 achieves relative improvements of 34.21\% over Pass@1 and 22.70\% over Maj@128 on HMMT25 using Qwen3-4B-Thinking-2507, highlighting its effectiveness.