Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Current methods for enhancing large language models’ (LLMs) reasoning capabilities heavily rely on human-annotated reasoning chains, ground-truth answers, or pretrained reward models—limiting scalability and generalizability. Method: This paper introduces the first fully unsupervised reasoning incentivization framework, which requires no external supervision. It operates solely in the latent semantic space by optimizing prediction entropy minimization (EMPO) on raw user queries, integrated with self-supervised policy updates and unlabeled query-driven reasoning evolution. Contribution/Results: It achieves, for the first time, continuous LLM reasoning improvement under pure self-supervision. Experiments show a +17.4 percentage-point gain in mathematical reasoning accuracy on Qwen2.5-Math-7B Base (30.7% → 48.1%) and a +10.09-point improvement in truthfulness accuracy on TruthfulQA using Qwen2.5-7B Instruct (87.16% → 97.25%).

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) have demonstrated exceptional capabilities in challenging tasks such as mathematical reasoning, existing methods to enhance reasoning ability predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data after pre-training. However, these approaches critically depend on external supervisions--such as human labelled reasoning traces, verified golden answers, or pre-trained reward models--which limits scalability and practical applicability. In this work, we propose Entropy Minimized Policy Optimization (EMPO), which makes an early attempt at fully unsupervised LLM reasoning incentivization. EMPO does not require any supervised information for incentivizing reasoning capabilities (i.e., neither verifiable reasoning traces, problems with golden answers, nor additional pre-trained reward models). By continuously minimizing the predictive entropy of LLMs on unlabeled user queries in a latent semantic space, EMPO enables purely self-supervised evolution of reasoning capabilities with strong flexibility and practicality. Our experiments demonstrate competitive performance of EMPO on both mathematical reasoning and free-form commonsense reasoning tasks. Specifically, without any supervised signals, EMPO boosts the accuracy of Qwen2.5-Math-7B Base from 30.7% to 48.1% on mathematical benchmarks and improves truthfulness accuracy of Qwen2.5-7B Instruct from 87.16% to 97.25% on TruthfulQA.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning without supervised fine-tuning or reinforcement learning

Eliminating dependency on external supervision for reasoning incentivization

Improving reasoning accuracy via self-supervised entropy minimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully unsupervised reasoning incentivization method

Minimizes predictive entropy in latent space

Self-supervised evolution without external supervision

🔎 Similar Papers

CuriousLLM: Elevating Multi-Document Question Answering with LLM-Enhanced Knowledge Graph Reasoning