Bayesian WeakS-to-Strong from Text Classification to Generation

📅 2024-05-24
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges in large-model super-alignment—namely, sparse human supervision and insufficient elicitation of strong model capabilities—this paper proposes WeakS-to-Strong, a novel paradigm. It simulates human opinion diversity via ensemble-based weak models, quantifies supervision uncertainty using Bayesian confidence estimation, and extends the weak-to-strong framework to text generation for the first time. Crucially, it replaces teacher-forcing in the generation head with direct preference optimization (DPO) to enable generative weak-supervision alignment. The method incorporates adaptive supervision policy learning and uncertainty-aware knowledge transfer. Experiments demonstrate substantial improvements in reliability and preference consistency of strong student models across both classification and generation tasks. This framework establishes a scalable pathway toward AGI alignment under weak supervision.

Technology Category

Application Category

📝 Abstract
Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model's preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.
Problem

Research questions and friction points this paper is trying to address.

Adapt alignment techniques for increasingly complex language models.
Extend Weak-to-Strong to WeakS-to-Strong using ensemble weak models.
Apply WeakS-to-Strong from text classification to text generation tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble weak models simulate human opinion variability
Bayesian approach estimates confidence scores for generalization
Extends WeakS-to-Strong to text generation tasks
🔎 Similar Papers
No similar papers found.
Ziyun Cui
Ziyun Cui
Tsinghua University
Z
Ziyang Zhang
Department of Electronic Engineering, Tsinghua University, Beijing, China
W
Wen Wu
Department of Engineering, University of Cambridge, Trumpington St., Cambridge, UK
Guangzhi Sun
Guangzhi Sun
University of Cambridge
Speech and language technologyconversational AI
C
Chao Zhang
Department of Electronic Engineering, Tsinghua University, Beijing, China