🤖 AI Summary
Evaluating multi-step reasoning in large language models (LLMs) lacks interpretability due to the opaque nature of chain-of-thought (CoT) processes.
Method: We propose the first geometric modeling framework grounded in Hamiltonian mechanics: CoTs are mapped as trajectories in an embedding-induced phase space, where kinetic energy quantifies reasoning progress and potential energy encodes problem relevance; their sum—the Hamiltonian energy—serves as a principled metric of reasoning quality.
Results: Empirical analysis across multiple multi-hop question-answering benchmarks reveals that correct CoTs exhibit significantly lower and more stable Hamiltonian energy than incorrect ones, enabling geometric separability between valid and invalid reasoning paths. This yields a physically inspired, interpretable paradigm for LLM reasoning diagnostics—uncovering generalizable geometric discriminative patterns while offering quantitative, energy-based assessment of reasoning fidelity without requiring ground-truth step-level annotations.
📝 Abstract
This paper proposes a novel approach to analyzing multi-hop reasoning in language models through Hamiltonian mechanics. We map reasoning chains in embedding spaces to Hamiltonian systems, defining a function that balances reasoning progression (kinetic energy) against question relevance (potential energy). Analyzing reasoning chains from a question-answering dataset reveals that valid reasoning shows lower Hamiltonian energy values, representing an optimal trade-off between information gathering and targeted answering. While our framework offers complex visualization and quantification methods, the claimed ability to"steer"or"improve"reasoning algorithms requires more rigorous empirical validation, as the connection between physical systems and reasoning remains largely metaphorical. Nevertheless, our analysis reveals consistent geometric patterns distinguishing valid reasoning, suggesting this physics-inspired approach offers promising diagnostic tools and new perspectives on reasoning processes in large language models.