Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions

📅 2025-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the Weak-to-Strong Generalization (W2SG) phenomenon—where weak models supervise strong models—to uncover the intrinsic mechanisms of cross-capability knowledge transfer. We propose a geometric characterization framework grounded in the kernel space derived from principal components of neural representations, enabling the first *label-free* theoretical prediction of W2SG performance. We prove that knowledge learnable by strong but not weak models is precisely captured by this kernel space, and we quantify both the supervision limitations of weak models and the inherent error-correction capacity of strong models. Integrating kernel methods, PCA, and neural representation geometry, our approach is validated across five NLP tasks (involving 52 LLMs) and a molecular property prediction task. Empirical results show that our representation-based metric accurately predicts W2SG performance trends without access to ground-truth labels, establishing a novel, interpretable, and predictive paradigm for weakly supervised learning.

Technology Category

Application Category

📝 Abstract
Weak-to-Strong Generalization (W2SG), where a weak model supervises a stronger one, serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. Promising empirical results revealed that a strong model can surpass its weak supervisor. While recent work has offered theoretical insights into this phenomenon, a clear understanding of the interactions between weak and strong models that drive W2SG remains elusive. We investigate W2SG through a theoretical lens and show that it can be characterized using kernels derived from the principal components of weak and strong models' internal representations. These kernels can be used to define a space that, at a high level, captures what the weak model is unable to learn but is learnable by the strong model. The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision. This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model, regardless of overfitting. Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs.
Problem

Research questions and friction points this paper is trying to address.

Weak-to-Strong Generalization
Learning Effect Enhancement
Cross-Level Learning Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

W2SG Phenomenon
Kernel Tool
Predictive Framework
🔎 Similar Papers
No similar papers found.