One-shot Entropy Minimization

๐Ÿ“… 2025-05-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work challenges the prevailing paradigm of large language model (LLM) post-training, which relies heavily on large-scale labeled datasets and handcrafted reward functions. We propose an entropy minimization method requiring only **a single unlabeled demonstration and 10 gradient steps**, enabling lightweight, fully unsupervised fine-tuningโ€”without human feedback or explicit reward modeling. Our approach significantly improves output consistency and task performance across diverse benchmarks. We systematically evaluate it on 15 open and proprietary LLMs spanning 13B to 440B parameters, demonstrating competitive or superior performance to conventional RLHF and rule-based reinforcement learning on benchmarks including AlpacaEval and MT-Bench. To our knowledge, this is the first empirical demonstration that single-sample entropy minimization achieves strong generalization and optimization efficiency. The method establishes a minimalist, scalable alternative for LLM post-training. Code is publicly available.

Technology Category

Application Category

๐Ÿ“ Abstract
We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.
Problem

Research questions and friction points this paper is trying to address.

Minimizing entropy with one unlabeled data
Achieving performance gains in 10 optimization steps
Challenging rule-based reinforcement learning paradigms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single unlabeled data for entropy minimization
10 steps optimization for performance improvements
Outperforms rule-based reinforcement learning
๐Ÿ”Ž Similar Papers
No similar papers found.
Zitian Gao
Zitian Gao
Ubiquant
language model reasoningmechanistic interpretability
L
Lynx Chen
Ubiquant
J
Joey Zhou
Ubiquant
B
Bryan Dai
Ubiquant