Learning to Reason Efficiently with A* Post-Training

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge that large language models often produce erroneous or redundant steps in natural language deductive reasoning, struggling to balance correctness and efficiency. The authors propose the first integration of A* search into the training of large language models for reasoning, framing the task as a path-finding problem. They employ a post-training strategy combining supervised fine-tuning with process-based reinforcement learning guided by A*-derived signals. This approach substantially improves reasoning accuracy—Llama-3.2 models (1B–3B parameters) see their near-zero baseline performance dramatically enhanced, even surpassing the larger DeepSeek-V3.2 model—and reveals the surprising efficacy of imperfect heuristics in large search spaces, enabling both efficient and reliable natural language proof generation.

📝 Abstract

Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search -- an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B--3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 -- a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.

Problem

Research questions and friction points this paper is trying to address.

deductive reasoning

large language models

reasoning efficiency

natural language inference

proof generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

A* search

reasoning efficiency

process reward modeling