Architecture-agnostic Lipschitz-constant Bayesian header and its application to resolve semantically proximal classification errors with vision transformers

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenge of structured label noise arising from semantically similar classes in supervised deep learning, a scenario where conventional robust methods often fail. The authors propose an architecture-agnostic Bayesian head module with Lipschitz constraints, readily pluggable into feature extractors such as Vision Transformers (ViTs). By applying spectral normalization to both the mean and log-variance of variational weights, the module enhances uncertainty calibration and mitigates noise amplification. A novel dual Lipschitz constraint is introduced to jointly quantify uncertainty and confidence, integrated with a fusion mechanism that leverages feature-space proximity and adaptive arithmetic averaging for effective mislabeled sample identification. Under a 15% semantic mislabeling rate, the method achieves a recall exceeding 0.93—over 7% higher than k-nearest-neighbor baselines—and demonstrates strong robustness against both structured and unstructured label noise as well as adversarial attacks.

📝 Abstract

Label noise remains a critical bottleneck for the generalization of supervised deep learning models, particularly when errors are structured rather than random. Standard robust training methods often fail in the presence of such semantically proximal classification errors. This work presents an architecture-agnostic Lipschitz-constant Bayesian header that can be integrated into feature extractors such as vision transformers, yielding the bi-Lipschitz-constrained Bayesian Vision Transformer (LipB-ViT). In contrast to conventional Bayesian layers, our approach enforces spectral normalization on both the mean and log-variance of the variational weights, which promotes calibrated predictive uncertainty and mitigates noise amplification. We further propose a novel metric to jointly capture uncertainty and confidence across misclassification rates, as well as an adaptive arithmetic-mean fusion scheme that combines feature-space proximity with predictive uncertainty to detect corrupted labels outperforming the state of the art k-nearest neighbor based identification methods by more than 7% reaching a recall of more than 0.93 at 15% semantically misclassified labels. Although computational costs increase due to Monte Carlo sampling, the method offers plug-and-play compatibility with pre-trained backbones and consistent hyperparameters across domains, suggesting strong utility for high-stakes applications with variable annotation reliability. The stabilized confidence estimates serve as the foundation for an analysis pipeline that jointly assesses dataset quality and label noise, yielding a second novel metric for their combined quantification. Lastly, we systematically evaluate LipB-ViT under both structured (adversarial) and unstructured noise at inference time, demonstrating its robustness in realistic high-noise and attack scenarios. We compare its performance against baseline methods.

Problem

Research questions and friction points this paper is trying to address.

label noise

semantically proximal errors

generalization

structured noise

vision transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lipschitz-constrained Bayesian

vision transformers

semantic label noise