When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses a critical security vulnerability in Concept Bottleneck Models (CBMs), whose interpretability introduces new risks: their concept layers are susceptible to adversarial manipulation, leading to severe misclassification. We formalize, for the first time, the notion of “concept-level attacks” as a novel threat model and propose SPECTRA, a training mechanism based on semantic perturbation regularization to enhance the robustness of concept representations. A theoretical framework is developed to quantify stability in concept space, enabling the first effective defense against perturbations targeting the concept layer. Experimental results demonstrate that SPECTRA increases the minimum perturbation norm required for a successful attack from 0.46 to over 4200, while incurring no more than a 2.2% drop in classification accuracy.

📝 Abstract

Concept Bottleneck Models (CBMs) have emerged as a cornerstone approach for interpretable machine learning, providing human-understandable intermediate representations through explicit concept activations. However, this interpretability fundamentally introduces a critical, previously unexplored attack surface: the concept bottleneck layer itself. We present a comprehensive, systematic study of concept-level adversarial vulnerabilities in CBMs, revealing that targeted, minimal perturbations operating on input pixels can induce catastrophic misclassification by manipulating semantic representations. We develop a rigorous theoretical framework to quantify concept-space robustness, establishing novel metrics that expose the vulnerability landscape of these architectures. Our extensive analysis on the CUB-200-2011 dataset demonstrates that standard CBMs exhibit severe susceptibility to concept-level manipulation. To address this critical weakness, we introduce SPECTRA (Semantic Perturbation-based Concept Training for Robustness against Attacks), a principled stability regularization defense. SPECTRA effectively hardens the semantic representation space, increasing the minimal perturbation norm required for a successful attack from 0.46 to over 4,200, rendering targeted concept manipulation computationally prohibitive. Furthermore, SPECTRA preserves baseline classification accuracy to within 2.2%. By establishing concept-level attacks as a fundamentally distinct threat model, this work opens a new research frontier at the intersection of interpretable machine learning and adversarial robustness.

Problem

Research questions and friction points this paper is trying to address.

Concept Bottleneck Models

adversarial attacks

interpretability

concept-level manipulation

semantic representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept Bottleneck Models

adversarial attacks

concept-level manipulation