Learning to Reason in LLMs by Expectation Maximization

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the inconsistency between rationale generation and answer prediction in large language model (LLM) reasoning. To resolve this, we propose a unified optimization framework based on latent-variable modeling and the Expectation-Maximization (EM) algorithm. Our key contribution is the first formal integration of EM into LLM reasoning training, coupled with Prompt Posterior Sampling (PPS)—a novel strategy that, while retaining only the rationale-generation phase, employs a learnable prompt posterior distribution to guide the model toward generating high-quality, answer-supporting rationales—without requiring answer-level supervision. Compared to self-teaching methods like STaR, PPS features a simpler architecture and achieves significant improvements in reasoning accuracy on ARC, MMLU, and OpenBookQA. Empirical analysis confirms that the design of the sampling distribution is decisive for enhancing reasoning capability, establishing a new paradigm for controllable, rationale-guided LLM inference.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive an expectation-maximization (EM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers. We instantiate and compare several sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR. Our experiments on the ARC, MMLU, and OpenBookQA datasets with the Llama and Qwen models show that the sampling scheme can significantly affect the accuracy of learned reasoning models. Despite its simplicity, we observe that PPS outperforms the other sampling schemes.
Problem

Research questions and friction points this paper is trying to address.

Learning reasoning in LLMs via expectation maximization
Comparing sampling schemes for rationale generation
Evaluating accuracy on ARC, MMLU, OpenBookQA datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formalizes reasoning as latent variable model
Derives EM objective for learning to reason
Compares sampling schemes like PPS and STaR
🔎 Similar Papers
No similar papers found.