🤖 AI Summary
This work addresses computational inefficiency in large language model (LLM) inference, where redundant token generation wastes resources. We propose a latent-trajectory–based early path screening method that dynamically predicts the success probability of candidate reasoning paths during autoregressive decoding. Our core innovation is the Latent-Trajectory signal—a lightweight metric quantifying three aspects of hidden-state evolution: (i) initial-to-final representation divergence, (ii) cumulative intermediate variation, and (iii) convergence toward the final state. Unlike conventional confidence scores or majority voting, this signal enables robust path pruning and answer selection across multiple parallel sampling trajectories. Experiments on standard reasoning benchmarks demonstrate a 2.6% absolute accuracy gain while reducing total token consumption by 70%, significantly improving both inference efficiency and effectiveness under test-time scaling.
📝 Abstract
Reasoning models improve their problem-solving ability through inference-time scaling, allocating more compute via longer token budgets. Identifying which reasoning traces are likely to succeed remains a key opportunity: reliably predicting productive paths can substantially reduce wasted computation and improve overall efficiency. We introduce Latent-Trajectory signals that characterize the temporal evolution of a model's internal representations during the generation of intermediate reasoning tokens. By measuring the overall change in latent representations between the start and end of reasoning, the change accumulated across intermediate steps, and the extent to which these changes advance toward the final state, we show that these signals predict solution accuracy more reliably than both cross-layer metrics and output-based confidence measures. When used to guide answer selection across multiple sampled generations, Latent-Trajectory signals make test-time scaling more effective and efficient than majority voting, reducing token usage by up to 70% while preserving and even improving accuracy by 2.6% on average. Moreover, these predictive signals often emerge early in the reasoning trace, enabling early selection and allocation of compute to the most promising candidates. Our findings contribute not only practical strategies for inference-time efficiency, but also a deeper interpretability perspective on how reasoning processes are represented and differentiated in latent space.