π€ AI Summary
To address the core challenge of simultaneously ensuring robustness and interpretability for AI models in high-stakes scenarios, this paper proposes CAVEβthe first image classification framework integrating 3D-aware robust representation learning with concept-level interpretability. Methodologically, CAVE employs 3D neural voxel modeling to learn physically grounded semantic concepts; it then aligns voxel representations with human-understandable concepts via voxel-concept distillation and analyzes concept activation vectors (CAVs) to enable sample-consistent, visually verifiable, and semantically plausible concept-driven inference. Contributions include: (1) the first unification of 3D geometric robustness with concept-based interpretability; (2) overcoming key limitations of prior black-box concept methods in generalizability and trustworthiness; and (3) achieving state-of-the-art out-of-distribution robustness (on OOD detection and corruption benchmarks) while significantly outperforming existing approaches across multiple quantitative interpretability metrics.
π Abstract
With the rise of neural networks, especially in high-stakes applications, these networks need two properties (i) robustness and (ii) interpretability to ensure their safety. Recent advances in classifiers with 3D volumetric object representations have demonstrated a greatly enhanced robustness in out-of-distribution data. However, these 3D-aware classifiers have not been studied from the perspective of interpretability. We introduce CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification. We design an inherently-interpretable and robust classifier by extending existing 3D-aware classifiers with concepts extracted from their volumetric representations for classification. In an array of quantitative metrics for interpretability, we compare against different concept-based approaches across the explainable AI literature and show that CAVE discovers well-grounded concepts that are used consistently across images, while achieving superior robustness.