Learning to Reason Efficiently with A* Post-Training

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that large language models often produce erroneous or redundant steps in natural language deductive reasoning, struggling to balance correctness and efficiency. The authors propose the first integration of A* search into the training of large language models for reasoning, framing the task as a path-finding problem. They employ a post-training strategy combining supervised fine-tuning with process-based reinforcement learning guided by A*-derived signals. This approach substantially improves reasoning accuracy—Llama-3.2 models (1B–3B parameters) see their near-zero baseline performance dramatically enhanced, even surpassing the larger DeepSeek-V3.2 model—and reveals the surprising efficacy of imperfect heuristics in large search spaces, enabling both efficient and reliable natural language proof generation.
📝 Abstract
Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search -- an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B--3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 -- a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.
Problem

Research questions and friction points this paper is trying to address.

deductive reasoning
large language models
reasoning efficiency
natural language inference
proof generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

A* search
reasoning efficiency
process reward modeling
supervised fine-tuning
deductive reasoning