Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

In weak-to-strong alignment, strong models often confidently err on samples lying in the blind spots of weak teachers, leading to alignment failure. This work addresses this issue by analyzing the problem through the lens of bias–variance–covariance decomposition, integrating mismatch theory with practical post-training pipelines. The authors derive a mismatch-based upper bound on the overall risk and introduce a “blind-spot deception” metric to characterize the mechanism of alignment breakdown. Empirical evaluation across SFT, RLHF, and RLAIF on PKU-SafeRLHF and HH-RLHF datasets reveals that the variance of the strong model is the strongest predictor of blind-spot deception across settings, with covariance providing supplementary signals. Furthermore, blind-spot assessment effectively distinguishes whether alignment failures stem from inherited weak supervision or from regions of high uncertainty in the weak model.

📝 Abstract

Weak-to-strong alignment offers a promising route to scalable supervision, but it can fail when a strong model becomes confidently wrong on examples that lie in the weak teacher's blind spots. Understanding such failures requires going beyond aggregate accuracy, since weak-to-strong errors depend not only on whether the strong model disagrees with its teacher, but also on how confidence and uncertainty are distributed across examples. In this work, we analyze weak-to-strong alignment through a bias-variance-covariance lens that connects misfit theory to practical post-training pipelines. We derive a misfit-based upper bound on weak-to-strong population risk and study its empirical components using continuous confidence scores. We evaluate four weak-to-strong pipelines spanning supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and reinforcement learning from AI feedback (RLAIF) on the PKU-SafeRLHF and HH-RLHF datasets. Using a blind-spot deception metric that isolates cases where the strong model is confidently wrong while the weak model is uncertain, we find that strong-model variance is the strongest empirical predictor of deception across our settings. Covariance provides additional but weaker information, indicating that weak-strong dependence matters, but does not by itself explain the observed failures. These results suggest that strong-model variance can serve as an early-warning signal for weak-to-strong deception, while blind-spot evaluation helps distinguish whether failures are inherited from weak supervision or arise in regions of weak-model uncertainty.

Problem

Research questions and friction points this paper is trying to address.

weak-to-strong alignment

bias-variance

model uncertainty

blind spots

deception

Innovation

Methods, ideas, or system contributions that make the work stand out.

weak-to-strong alignment

bias-variance-covariance decomposition

misfit theory