Your Self-Play Algorithm is Secretly an Adversarial Imitator: Understanding LLM Self-Play through the Lens of Imitation Learning

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the lack of theoretical grounding and suboptimal stability and performance in existing self-play fine-tuning methods for large language models. It introduces a novel perspective by formulating self-play as a minimax game between the language model and a parameterized implicit reward player, thereby unifying self-play imitation learning and preference alignment within a single adversarial imitation learning framework. The key innovation lies in a variational objective based on χ² divergence with bounded implicit rewards, which offers both theoretical interpretability and enhanced training stability, accompanied by a rigorous convergence analysis. Experimental results across multiple language model fine-tuning benchmarks consistently demonstrate the superiority of the proposed method over current self-play algorithms, validating the efficacy of the theoretical framework and the new algorithm.

Technology Category

Application Category

📝 Abstract

Self-play post-training methods has emerged as an effective approach for finetuning large language models and turn the weak language model into strong language model without preference data. However, the theoretical foundations for self-play finetuning remain underexplored. In this work, we tackle this by connecting self-play finetuning with adversarial imitation learning by formulating finetuning procedure as a min-max game between the model and a regularized implicit reward player parameterized by the model itself. This perspective unifies self-play imitation and general preference alignment within a common framework. Under this formulation, we present a game-theoretic analysis showing that the self-play finetuning will converge to it's equilibrium. Guided by this theoretical formulation, we propose a new self-play imitation finetuning algorithm based on the $\chi^2$-divergence variational objective with bounded rewards and improved stability. Experiments on various of language model finetuning tasks demonstrate consistent improvements over existing self-play methods and validate our theoretical insights.

Problem

Research questions and friction points this paper is trying to address.

self-play

imitation learning

large language models

theoretical foundations

preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-play

adversarial imitation learning

min-max game