The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of enhancing large language models’ (LLMs) complex reasoning capabilities—particularly in mathematics, physics, and programming—without supervision. We propose a pure entropy minimization paradigm, introducing a dual-path unsupervised optimization framework: an *EM-FT/EM-RL training phase* leveraging token-level entropy minimization and negative-entropy reinforcement learning, and an *EM-INF inference phase* employing logit adjustment and self-generated fine-tuning. Crucially, we provide the first systematic demonstration that output entropy minimization alone suffices to unlock latent reasoning competence in pretrained LLMs—requiring neither labeled data nor parameter updates. Experiments show that Qwen-7B under EM-RL significantly outperforms GRPO/RLOO trained on 60K supervised samples; Qwen-32B with EM-INF achieves GPT-4o-level performance on SciCode while attaining threefold higher inference efficiency than self-consistency methods.

Technology Category

Application Category

📝 Abstract

Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.

Problem

Research questions and friction points this paper is trying to address.

Improves LLM reasoning via entropy minimization without labeled data

Enhances math, physics, coding tasks using three EM approaches

Enables smaller models to match proprietary models' performance efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy minimization trains models without labeled data

Three approaches: EM-FT, EM-RL, EM-INF

EM-INF matches GPT-4o performance efficiently

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting