One-shot Entropy Minimization

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work challenges the prevailing paradigm of large language model (LLM) post-training, which relies heavily on large-scale labeled datasets and handcrafted reward functions. We propose an entropy minimization method requiring only **a single unlabeled demonstration and 10 gradient steps**, enabling lightweight, fully unsupervised fine-tuning—without human feedback or explicit reward modeling. Our approach significantly improves output consistency and task performance across diverse benchmarks. We systematically evaluate it on 15 open and proprietary LLMs spanning 13B to 440B parameters, demonstrating competitive or superior performance to conventional RLHF and rule-based reinforcement learning on benchmarks including AlpacaEval and MT-Bench. To our knowledge, this is the first empirical demonstration that single-sample entropy minimization achieves strong generalization and optimization efficiency. The method establishes a minimalist, scalable alternative for LLM post-training. Code is publicly available.

Technology Category

Application Category

📝 Abstract

We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.

Problem

Research questions and friction points this paper is trying to address.

Minimizing entropy with one unlabeled data

Achieving performance gains in 10 optimization steps

Challenging rule-based reinforcement learning paradigms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single unlabeled data for entropy minimization

10 steps optimization for performance improvements

Outperforms rule-based reinforcement learning

🔎 Similar Papers

Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition