Learning Concept Bottleneck Models from Mechanistic Explanations

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional Concept Bottleneck Models (CBMs), which rely on manually predefined concepts that often lack task relevance or learnability, thereby constraining model performance. To overcome this, the authors propose the Mechanistic Concept Bottleneck Model (M-CBM), which automatically extracts task-relevant, interpretable latent concepts directly from the internal representations of a black-box model using sparse autoencoders. These extracted concepts are then semantically labeled and annotated via a multimodal large language model to construct an adaptive bottleneck layer. For fair evaluation of information leakage and interpretability, the study introduces Normalized Concept Consistency (NCC), a decision-level sparsity metric. Experiments across multiple datasets demonstrate that M-CBM significantly outperforms existing CBM approaches under matched sparsity levels, achieving higher concept prediction accuracy while providing concise and human-interpretable decision rationales.

Technology Category

Application Category

📝 Abstract
Concept Bottleneck Models (CBMs) aim for ante-hoc interpretability by learning a bottleneck layer that predicts interpretable concepts before the decision. State-of-the-art approaches typically select which concepts to learn via human specification, open knowledge graphs, prompting an LLM, or using general CLIP concepts. However, concepts defined a-priori may not have sufficient predictive power for the task or even be learnable from the available data. As a result, these CBMs often significantly trail their black-box counterpart when controlling for information leakage. To address this, we introduce a novel CBM pipeline named Mechanistic CBM (M-CBM), which builds the bottleneck directly from a black-box model's own learned concepts. These concepts are extracted via Sparse Autoencoders (SAEs) and subsequently named and annotated on a selected subset of images using a Multimodal LLM. For fair comparison and leakage control, we also introduce the Number of Contributing Concepts (NCC), a decision-level sparsity metric that extends the recently proposed NEC metric. Across diverse datasets, we show that M-CBMs consistently surpass prior CBMs at matched sparsity, while improving concept predictions and providing concise explanations. Our code is available at https://github.com/Antonio-Dee/M-CBM.
Problem

Research questions and friction points this paper is trying to address.

Concept Bottleneck Models
interpretability
predictive power
concept learnability
information leakage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept Bottleneck Models
Sparse Autoencoders
Mechanistic Interpretability
Multimodal LLM
Sparsity Metric
🔎 Similar Papers