🤖 AI Summary
This work addresses the limitations of existing self-training methods for fine-tuning large language models, which are highly sensitive to synthetic data quality and suffer from diminishing margins between positive and negative samples during iterative optimization. To overcome these challenges, the authors propose the TPAW algorithm, which operates in a fully self-supervised setting by constructing a cooperative-competitive ensemble composed of the current policy model and historical checkpoints to engage in self-play. TPAW incorporates a dual adaptive weighting mechanism—comprising response reweighting and participant dynamic weighting—to enhance training stability and alignment efficacy. Requiring no human supervision and initialized solely from a supervised fine-tuned (SFT) model, TPAW iteratively refines model performance and consistently outperforms state-of-the-art baselines across multiple base models and LLM benchmarks, significantly improving alignment outcomes.
📝 Abstract
While recent self-training approaches have reduced reliance on human-labeled data for aligning LLMs, they still face critical limitations: (i) sensitivity to synthetic data quality, leading to instability and bias amplification in iterative training; (ii) ineffective optimization due to a diminishing gap between positive and negative responses over successive training iterations. In this paper, we propose Team-based self-Play with dual Adaptive Weighting (TPAW), a novel self-play algorithm designed to improve alignment in a fully self-supervised setting. TPAW adopts a team-based framework in which the current policy model both collaborates with and competes against historical checkpoints, promoting more stable and efficient optimization. To further enhance learning, we design two adaptive weighting mechanisms: (i) a response reweighting scheme that adjusts the importance of target responses, and (ii) a player weighting strategy that dynamically modulates each team member's contribution during training. Initialized from a SFT model, TPAW iteratively refines alignment without requiring additional human supervision. Experimental results demonstrate that TPAW consistently outperforms existing baselines across various base models and LLM benchmarks. Our code is publicly available at https://github.com/lab-klc/TPAW.