A transfer learning framework for weak-to-strong generalization

📅 2024-05-25

📈 Citations: 2

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work investigates the “weak-to-strong generalization” problem: whether weak human feedback—provided by less-capable annotators or models—can safely align superhuman large language models (LLMs). The authors establish, for the first time, a theoretical guarantee of feasible generalization under plausible assumptions. They propose a refinement-based training framework grounded in implicit concept prior transfer: by extracting implicit alignment knowledge embedded in weak feedback and transferring its underlying conceptual priors to the strong model, the method overcomes the capability degradation inherent in conventional fine-tuning paradigms. Experiments across multiple LLM alignment benchmarks demonstrate that the approach significantly improves alignment performance while fully preserving—and in some cases enhancing—the model’s original capabilities. This provides both theoretical foundations and a practical pathway for safe, lossless alignment of superhuman models.

Technology Category

Application Category

📝 Abstract

Modern large language model (LLM) alignment techniques rely on human feedback, but it is unclear whether these techniques fundamentally limit the capabilities of aligned LLMs. In particular, it is unknown if it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities. This is an instance of the weak-to-strong generalization problem: using feedback from a weaker (less capable) model to train a stronger (more capable) model. We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs. In particular, we cast the weak-to-strong generalization problem as a transfer learning problem in which we wish to transfer a latent concept prior from a weak model to a strong pre-trained model. We prove that a naive fine-tuning approach suffers from fundamental limitations, but an alternative refinement-based approach suggested by the problem structure provably overcomes the limitations of fine-tuning. Finally, we demonstrate the practical applicability of the refinement approach in multiple LLM alignment tasks.

Problem

Research questions and friction points this paper is trying to address.

Aligning strong LLMs with weak human feedback without degrading capabilities.

Overcoming limitations of fine-tuning in weak-to-strong generalization.

Transferring latent knowledge from weak to strong pre-trained models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transfer learning for weak-to-strong generalization

Eliciting latent knowledge from pre-trained LLMs

Refinement-based approach overcomes fine-tuning limitations

🔎 Similar Papers

No similar papers found.