On Weak-to-Strong Generalization and f-Divergence

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weak-to-strong transfer—migrating knowledge from weakly supervised models to strong pre-trained models—suffers from high computational overhead and reliance on auxiliary weak models. Method: We propose an information-theoretic loss framework based on *f*-divergence, which reformulates the transfer objective via loss-function reconstruction, eliminating the need for multi-model collaboration or complex distillation pipelines. Contribution/Results: Our work is the first to systematically characterize the theoretical limitations and equivalence of *f*-divergences in weak-to-strong generalization, derive sample complexity bounds, and prove their unified improvement in both noise robustness and generalization performance. Extensive experiments across multiple benchmarks demonstrate that our approach achieves superior generalization under noisy labels—without requiring any auxiliary weak model—while significantly reducing memory footprint and computational cost.

Technology Category

Application Category

📝 Abstract
Weak-to-strong generalization (W2SG) has emerged as a promising paradigm for stimulating the capabilities of strong pre-trained models by leveraging supervision from weaker supervisors. To improve the performance of the strong model, existing methods often require additional weak models or complex procedures, leading to substantial computational and memory overhead. Motivated by the effectiveness of $f$-divergence loss in various machine learning domains, we introduce $f$-divergence as an information-theoretic loss function framework in W2SG. Our theoretical analysis reveals fundamental limitations and equivalence of different $f$-divergence losses in W2SG, supported by sample complexity bounds and information-theoretic insights. We empirically demonstrate that $f$-divergence loss, which generalizes widely-used metrics like KL divergence, effectively improves generalization and noise tolerance of the strong model in practice.
Problem

Research questions and friction points this paper is trying to address.

Improving weak-to-strong generalization with f-divergence
Reducing computational overhead in strong model training
Enhancing noise tolerance via information-theoretic loss functions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses f-divergence loss in W2SG
Improves strong model generalization
Reduces computational overhead effectively
🔎 Similar Papers
No similar papers found.
W
Wei Yao
Gaoling School of Artificial Intelligence, Renmin University of China
G
Gengze Xu
Gaoling School of Artificial Intelligence, Renmin University of China
H
Huayi Tang
Gaoling School of Artificial Intelligence, Renmin University of China
Wenkai Yang
Wenkai Yang
Renmin University of China
Natural Language ProcessingMachine Learning
Donglin Di
Donglin Di
Li Auto Inc.
Generative ModelsEmbodied AIMedical ImageMultimedia
Ziqiao Wang
Ziqiao Wang
Assistant Professor of Computer Science, Tongji University
Machine LearningStatistical Learning TheoryDeep LearningInformation Theory
Y
Yong Liu
Gaoling School of Artificial Intelligence, Renmin University of China