Bayesian WeakS-to-Strong from Text Classification to Generation

📅 2024-05-24

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the challenges in large-model super-alignment—namely, sparse human supervision and insufficient elicitation of strong model capabilities—this paper proposes WeakS-to-Strong, a novel paradigm. It simulates human opinion diversity via ensemble-based weak models, quantifies supervision uncertainty using Bayesian confidence estimation, and extends the weak-to-strong framework to text generation for the first time. Crucially, it replaces teacher-forcing in the generation head with direct preference optimization (DPO) to enable generative weak-supervision alignment. The method incorporates adaptive supervision policy learning and uncertainty-aware knowledge transfer. Experiments demonstrate substantial improvements in reliability and preference consistency of strong student models across both classification and generation tasks. This framework establishes a scalable pathway toward AGI alignment under weak supervision.

Technology Category

Application Category

📝 Abstract

Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model's preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.

Problem

Research questions and friction points this paper is trying to address.

Adapt alignment techniques for increasingly complex language models.

Extend Weak-to-Strong to WeakS-to-Strong using ensemble weak models.

Apply WeakS-to-Strong from text classification to text generation tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble weak models simulate human opinion variability

Bayesian approach estimates confidence scores for generalization

Extends WeakS-to-Strong to text generation tasks

🔎 Similar Papers

No similar papers found.