🤖 AI Summary
Supervised fine-tuning (SFT) with cross-entropy (CE) loss often suppresses output diversity in large language models (LLMs), hindering sampling-based exploration and pretraining knowledge retention. To address this, we formulate SFT as a two-player game—between the model and an auxiliary distribution—and theoretically establish its equivalence to joint optimization of inverse KL divergence minimization and entropy regularization. Based on this insight, we propose GEM, an efficient training algorithm that explicitly controls the output distribution while preserving CE-level computational overhead. Extensive evaluation across 3B–70B models shows that GEM matches CE’s downstream task performance while significantly improving output diversity; enhances chat and code generation under test-time compute scaling; and mitigates pretraining knowledge forgetting. Our core contributions are: (i) a game-theoretic framework for diversity preservation in SFT, and (ii) a scalable, unified optimization mechanism combining inverse KL divergence and entropy regularization.
📝 Abstract
Large Language Models (LLMs) typically rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks, with the Cross Entropy (CE) loss being the de facto choice. However, CE maximizes the likelihood of observed data without accounting for alternative possibilities. As such, CE usually leads to reduced diversity in the model's outputs, which hinders further development that requires sampling to explore better responses. To address this limitation, this paper introduces a new game-theoretic formulation for SFT. In this framework, an auxiliary variable is introduced to regulate the learning process. We prove that the proposed game-theoretic approach connects to the problem of reverse KL minimization with entropy regularization. This regularization prevents over-memorization of training data and promotes output diversity. To implement this framework, we develop GEM, a new training algorithm that is computationally efficient as CE by leveraging some unique properties of LLMs. Empirical studies of pre-trained models from 3B to 70B parameters show that GEM achieves comparable downstream performance to CE while significantly enhancing output diversity. This increased diversity translates to performance gains in test-time compute scaling for chat and code generation tasks. Moreover, we observe that preserving output diversity has the added benefit of mitigating forgetting, as maintaining diverse outputs encourages models to retain pre-trained knowledge throughout the training process.