🤖 AI Summary
Existing evaluation metrics struggle to detect “correct yet fragile” predictions—those that are accurate under clean inputs but exhibit probability mass shifting toward incorrect classes under perturbations. This work proposes FragileFlow, the first method to formally characterize the phenomenon of margin-aware error flow. It introduces a calibrated margin buffer to identify fragile predictions and constructs a class-wise fragility risk matrix. Theoretically, we establish the first PAC-Bayes upper bound for this problem and reveal a connection between empirical spectral control and worst-class robustness. Practically, by integrating spectral regularization, our approach significantly improves worst-class accuracy under perturbations across multiple-choice large language model benchmarks and few-shot CLIP tasks, while preserving original performance on clean data.
📝 Abstract
Robust adaptation of LLMs and VLMs is often evaluated by average accuracy or average consistency under perturbations. However, these averages can hide a structured failure mode: a prediction may remain correct while probability mass already flows from particular true classes toward systematic wrong competitors near the decision boundary. In this paper, we formalize this phenomenon as margin-aware error flow and introduce FragileFlow, a plug-in regularizer that uses a calibrated margin buffer to identify correct-but-fragile predictions and organize their off-class probability mass into a class-wise vulnerable-risk matrix. Theoretically, we provide the first PAC-Bayes upper bound for this margin-aware error-flow object, showing how empirical spectral control yields a conservative route to deterministic worst-class robustness under a stability condition. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation show that FragileFlow consistently improves the proposed theory-facing risk measures over matched baselines, yields perturbed worst-class accuracy gains in most settings, and preserves clean accuracy across comparisons.