🤖 AI Summary
Neural network failure detection remains unreliable in safety-critical applications—particularly due to overconfidence under misclassification and the lack of interpretability in existing logit-based confidence estimation methods.
Method: This paper proposes the first dual-objective framework grounded in human-understandable visual concepts. It models concept activation, introduces an ordinal ranking mechanism, and fuses multi-source signals to generate fine-grained, interpretable confidence scores—enabling transparent failure attribution without modifying the base model architecture.
Contribution/Results: The method achieves significant improvements via post-hoc processing: false positive rates decrease by 3.7% on ImageNet and 9.0% on EuroSAT. It simultaneously ensures high detection reliability, strong interpretability, and deployment efficiency—bridging critical gaps between robustness, transparency, and practicality in real-world vision systems.
📝 Abstract
Reliable failure detection holds paramount importance in safety-critical applications. Yet, neural networks are known to produce overconfident predictions for misclassified samples. As a result, it remains a problematic matter as existing confidence score functions rely on category-level signals, the logits, to detect failures. This research introduces an innovative strategy, leveraging human-level concepts for a dual purpose: to reliably detect when a model fails and to transparently interpret why. By integrating a nuanced array of signals for each category, our method enables a finer-grained assessment of the model's confidence. We present a simple yet highly effective approach based on the ordinal ranking of concept activation to the input image. Without bells and whistles, our method significantly reduce the false positive rate across diverse real-world image classification benchmarks, specifically by 3.7% on ImageNet and 9% on EuroSAT.