🤖 AI Summary
This work addresses the lack of reliable confidence estimation in deployed Vision Transformers (ViTs) by proposing a lightweight method that predicts classification errors with only a single forward pass. The approach attaches linear probes to intermediate layers of a ViT to extract trajectory features—including logits of the predicted and competing classes and the stability of class rankings—from the last L layers. It introduces, for the first time, the concept of depth-wise signal analysis from large language models into ViTs, establishing an error detection mechanism based on inter-layer logit dynamics. With minimal computational overhead, the method achieves competitive or superior AUC-PR performance across multiple datasets and demonstrates strong cross-dataset generalization capabilities.
📝 Abstract
Reliable confidence estimation is critical when deploying vision models. We study error prediction: determining whether an image classifier's output is correct using only signals from a single forward pass. Motivated by internal-signal hallucination detection in large language models, we investigate whether similar depth-wise signals exist in Vision Transformers (ViTs). We propose a simple method that models how class evidence evolves across layers. By attaching lightweight linear heads to intermediate layers, we extract features from the last L layers that capture both the logits of the predicted class and its top-K competitors, as well as statistics describing instability of top-ranked classes across depth. A linear probe trained on these features predicts the error indicator. Across datasets, our method improves or matches AUCPR over baselines and shows stronger cross-dataset generalization while requiring minimal additional computation.