Does Weak-to-strong Generalization Happen under Spurious Correlations?

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This paper investigates whether “weak-to-strong” (W2S) generalization holds—and why it fails—when fine-tuning a strong student model using pseudo-labels generated by a weak teacher, particularly in downstream tasks exhibiting spurious correlations. Theoretical analysis identifies a critical failure mechanism: distributional shift in the minority-group proportion between pseudo-labels and ground-truth labels. To address this, we propose a group-label-free confidence-based subset retraining method that dynamically constructs a robust training set by selecting high-confidence pseudo-labels and modeling subgroup imbalance. Through asymptotic theoretical analysis and extensive experiments across diverse spurious-correlation benchmarks and teacher–student architectures, our method consistently improves W2S generalization and out-of-distribution robustness. It establishes a novel, trustworthy paradigm for knowledge distillation without requiring access to group annotations.

Technology Category

Application Category

📝 Abstract

We initiate a unified theoretical and algorithmic study of a key problem in weak-to-strong (W2S) generalization: when fine-tuning a strong pre-trained student with pseudolabels from a weaker teacher on a downstream task with spurious correlations, does W2S happen, and how to improve it upon failures? We consider two sources of spurious correlations caused by group imbalance: (i) a weak teacher fine-tuned on group-imbalanced labeled data with a minority group of fraction $η_ell$, and (ii) a group-imbalanced unlabeled set pseudolabeled by the teacher with a minority group of fraction $η_u$. Theoretically, a precise characterization of W2S gain at the proportional asymptotic limit shows that W2S always happens with sufficient pseudolabels when $η_u = η_ell$ but may fail when $η_u e η_ell$, where W2S gain diminishes as $(η_u - η_ell)^2$ increases. Our theory is corroborated by extensive experiments on various spurious correlation benchmarks and teacher-student pairs. To boost W2S performance upon failures, we further propose a simple, effective algorithmic remedy that retrains the strong student on its high-confidence data subset after W2S fine-tuning. Our algorithm is group-label-free and achieves consistent, substantial improvements over vanilla W2S fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Studying weak-to-strong generalization under spurious correlations

Analyzing failure cases when minority group fractions differ

Proposing an algorithm to improve weak-to-strong performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning strong student with weak teacher pseudolabels

Retraining student on high-confidence data subset

Group-label-free algorithm improving weak-to-strong generalization

🔎 Similar Papers

Spurious Correlations in Machine Learning: A Survey