🤖 AI Summary
To address the challenges in large-model super-alignment—namely, sparse human supervision and insufficient elicitation of strong model capabilities—this paper proposes WeakS-to-Strong, a novel paradigm. It simulates human opinion diversity via ensemble-based weak models, quantifies supervision uncertainty using Bayesian confidence estimation, and extends the weak-to-strong framework to text generation for the first time. Crucially, it replaces teacher-forcing in the generation head with direct preference optimization (DPO) to enable generative weak-supervision alignment. The method incorporates adaptive supervision policy learning and uncertainty-aware knowledge transfer. Experiments demonstrate substantial improvements in reliability and preference consistency of strong student models across both classification and generation tasks. This framework establishes a scalable pathway toward AGI alignment under weak supervision.
📝 Abstract
Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model's preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.