HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based mathematical reasoning agents struggle to balance the flexibility of informal reasoning with the rigor of formal proof verification: informal reasoning risks latent errors, while formal methods constrain exploratory problem solving. Method: We propose Hermes, the first tool-augmented agent that interleaves informal reasoning and formal verification within the Lean theorem prover. Its core innovations are (1) an intermediate formal checking mechanism that detects and corrects reasoning drift in real time, and (2) a state-aware memory module ensuring semantic coherence across long proof chains. Hermes unifies exploration and correctness via LLM-driven tool invocation, stepwise verification, and persistent memory management. Results: Experiments show Hermes achieves up to a 67% absolute accuracy gain across multiple mathematical reasoning benchmarks, while reducing inference FLOPs by 80%, significantly improving both reliability and computational efficiency.

Technology Category

Application Category

📝 Abstract
Informal mathematics has been central to modern large language model (LLM) reasoning, offering flexibility and enabling efficient construction of arguments. However, purely informal reasoning is prone to logical gaps and subtle errors that are difficult to detect and correct. In contrast, formal theorem proving provides rigorous, verifiable mathematical reasoning, where each inference step is checked by a trusted compiler in systems such as Lean, but lacks the exploratory freedom of informal problem solving. This mismatch leaves current LLM-based math agents without a principled way to combine the strengths of both paradigms. In this work, we introduce Hermes, the first tool-assisted agent that explicitly interleaves informal reasoning with formally verified proof steps in Lean. The framework performs intermediate formal checking to prevent reasoning drift and employs a memory module that maintains proof continuity across long, multi-step reasoning chains, enabling both exploration and verification within a single workflow. We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales, from small models to state-of-the-art systems. Across all settings, Hermes reliably improves the reasoning accuracy of base models while substantially reducing token usage and computational cost compared to reward-based approaches. On difficult datasets such as AIME'25, Hermes achieves up to a 67% accuracy improvement while using 80% fewer total inference FLOPs. The implementation and codebase are publicly available at https://github.com/aziksh-ospanov/HERMES.
Problem

Research questions and friction points this paper is trying to address.

Bridging informal reasoning flexibility with formal verification rigor
Preventing logical gaps in mathematical proofs through intermediate checking
Reducing computational costs while improving reasoning accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaves informal reasoning with formal verification
Uses intermediate formal checking to prevent errors
Employs memory module for multi-step proof continuity
🔎 Similar Papers
No similar papers found.