๐ค AI Summary
This work challenges the prevailing paradigm of large language model (LLM) post-training, which relies heavily on large-scale labeled datasets and handcrafted reward functions. We propose an entropy minimization method requiring only **a single unlabeled demonstration and 10 gradient steps**, enabling lightweight, fully unsupervised fine-tuningโwithout human feedback or explicit reward modeling. Our approach significantly improves output consistency and task performance across diverse benchmarks. We systematically evaluate it on 15 open and proprietary LLMs spanning 13B to 440B parameters, demonstrating competitive or superior performance to conventional RLHF and rule-based reinforcement learning on benchmarks including AlpacaEval and MT-Bench. To our knowledge, this is the first empirical demonstration that single-sample entropy minimization achieves strong generalization and optimization efficiency. The method establishes a minimalist, scalable alternative for LLM post-training. Code is publicly available.
๐ Abstract
We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.