Probing the Trajectories of Reasoning Traces in Large Language Models

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the dynamic interplay between accuracy and decision certainty during large language model (LLM) reasoning, and whether intermediate reasoning trajectories contain substantive answer-relevant information beyond mere length or stylistic cues. To this end, the authors propose a “trajectory probing” protocol: reasoning trajectories are truncated at token-level percentiles, and the resulting fragments are reinjected into the model to assess their influence on the answer distribution via next-token probability analysis. Extensive experiments across multiple benchmarks on Qwen3 and gpt-oss model families demonstrate that reasoning content itself—not context length or style—drives performance gains; stronger models can recover from erroneous trajectories, and both accuracy and certainty consistently increase with the proportion of reasoning steps retained. This approach offers a novel diagnostic tool for reliable LLM deployment and supports safer, more efficient inference strategies.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) increasingly solve difficult problems by producing"reasoning traces"before emitting a final response. However, it remains unclear how accuracy and decision commitment evolve along a reasoning trajectory, and whether intermediate trace segments provide answer-relevant information beyond generic length or stylistic effects. Here, we propose a protocol to systematically probe the trajectories of reasoning traces in LLMs by 1) generating a model's reasoning trace, 2) truncating it at fixed token-percentiles, and 3) injecting each partial trace back into the model (or a different model) to measure the induced distribution over answer choices via next-token probabilities. We apply this protocol to the open-source Qwen3-4B/-8B/-14B and gpt-oss-20b/-120b models across the multiple-choice GPQA Diamond and MMLU-Pro benchmarks. We find that accuracy and decision commitment consistently increase as the percentage of provided reasoning tokens grows. These gains are primarily driven by relevant content in the model generation rather than context length or generic"reasoning style"effects. Stronger models often backtrack successfully from incorrect partial traces, but immediate answers often remain anchored in the weaker model's incorrect response. More broadly, we show that trajectory probing provides diagnostics for efficient and safer deployment of reasoning models as the measurements can inform practical trace-handling and monitoring policies that improve reliability without assuming intermediate tokens are inherently faithful explanations.
Problem

Research questions and friction points this paper is trying to address.

reasoning traces
large language models
accuracy evolution
decision commitment
trajectory probing
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning traces
trajectory probing
decision commitment
answer relevance
model diagnostics
🔎 Similar Papers
No similar papers found.
M
Marthe Ballon
Data Analytics Lab, Vrije Universiteit Brussel, Pleinlaan 5, 1050 Brussel, Belgium; imec-SMIT, Vrije Universiteit Brussel, Pleinlaan 9, 1050 Brussels, Belgium
B
Brecht Verbeken
Data Analytics Lab, Vrije Universiteit Brussel, Pleinlaan 5, 1050 Brussel, Belgium; imec-SMIT, Vrije Universiteit Brussel, Pleinlaan 9, 1050 Brussels, Belgium
Vincent Ginis
Vincent Ginis
Vrije Universiteit Brussel / Harvard University
Physics | Machine Learning
A
A. Algaba
Data Analytics Lab, Vrije Universiteit Brussel, Pleinlaan 5, 1050 Brussel, Belgium; imec-SMIT, Vrije Universiteit Brussel, Pleinlaan 9, 1050 Brussels, Belgium