Learning Concept Bottleneck Models from Mechanistic Explanations

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the limitations of traditional Concept Bottleneck Models (CBMs), which rely on manually predefined concepts that often lack task relevance or learnability, thereby constraining model performance. To overcome this, the authors propose the Mechanistic Concept Bottleneck Model (M-CBM), which automatically extracts task-relevant, interpretable latent concepts directly from the internal representations of a black-box model using sparse autoencoders. These extracted concepts are then semantically labeled and annotated via a multimodal large language model to construct an adaptive bottleneck layer. For fair evaluation of information leakage and interpretability, the study introduces Normalized Concept Consistency (NCC), a decision-level sparsity metric. Experiments across multiple datasets demonstrate that M-CBM significantly outperforms existing CBM approaches under matched sparsity levels, achieving higher concept prediction accuracy while providing concise and human-interpretable decision rationales.

Technology Category

Application Category

📝 Abstract

Concept Bottleneck Models (CBMs) aim for ante-hoc interpretability by learning a bottleneck layer that predicts interpretable concepts before the decision. State-of-the-art approaches typically select which concepts to learn via human specification, open knowledge graphs, prompting an LLM, or using general CLIP concepts. However, concepts defined a-priori may not have sufficient predictive power for the task or even be learnable from the available data. As a result, these CBMs often significantly trail their black-box counterpart when controlling for information leakage. To address this, we introduce a novel CBM pipeline named Mechanistic CBM (M-CBM), which builds the bottleneck directly from a black-box model's own learned concepts. These concepts are extracted via Sparse Autoencoders (SAEs) and subsequently named and annotated on a selected subset of images using a Multimodal LLM. For fair comparison and leakage control, we also introduce the Number of Contributing Concepts (NCC), a decision-level sparsity metric that extends the recently proposed NEC metric. Across diverse datasets, we show that M-CBMs consistently surpass prior CBMs at matched sparsity, while improving concept predictions and providing concise explanations. Our code is available at https://github.com/Antonio-Dee/M-CBM.

Problem

Research questions and friction points this paper is trying to address.

Concept Bottleneck Models

interpretability

predictive power

concept learnability

information leakage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept Bottleneck Models

Sparse Autoencoders

Mechanistic Interpretability