π€ AI Summary
This work addresses a critical security vulnerability in Concept Bottleneck Models (CBMs), whose interpretability introduces new risks: their concept layers are susceptible to adversarial manipulation, leading to severe misclassification. We formalize, for the first time, the notion of βconcept-level attacksβ as a novel threat model and propose SPECTRA, a training mechanism based on semantic perturbation regularization to enhance the robustness of concept representations. A theoretical framework is developed to quantify stability in concept space, enabling the first effective defense against perturbations targeting the concept layer. Experimental results demonstrate that SPECTRA increases the minimum perturbation norm required for a successful attack from 0.46 to over 4200, while incurring no more than a 2.2% drop in classification accuracy.
π Abstract
Concept Bottleneck Models (CBMs) have emerged as a cornerstone approach for interpretable machine learning, providing human-understandable intermediate representations through explicit concept activations. However, this interpretability fundamentally introduces a critical, previously unexplored attack surface: the concept bottleneck layer itself. We present a comprehensive, systematic study of concept-level adversarial vulnerabilities in CBMs, revealing that targeted, minimal perturbations operating on input pixels can induce catastrophic misclassification by manipulating semantic representations. We develop a rigorous theoretical framework to quantify concept-space robustness, establishing novel metrics that expose the vulnerability landscape of these architectures. Our extensive analysis on the CUB-200-2011 dataset demonstrates that standard CBMs exhibit severe susceptibility to concept-level manipulation. To address this critical weakness, we introduce SPECTRA (Semantic Perturbation-based Concept Training for Robustness against Attacks), a principled stability regularization defense. SPECTRA effectively hardens the semantic representation space, increasing the minimal perturbation norm required for a successful attack from 0.46 to over 4,200, rendering targeted concept manipulation computationally prohibitive. Furthermore, SPECTRA preserves baseline classification accuracy to within 2.2%. By establishing concept-level attacks as a fundamentally distinct threat model, this work opens a new research frontier at the intersection of interpretable machine learning and adversarial robustness.