Concept Bottleneck Large Language Models

📅 2024-12-11

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Large language models (LLMs) suffer from poor interpretability, low safety, and limited trustworthiness due to their inherent “black-box” nature. Method: This paper proposes an intrinsically interpretable LLM construction paradigm, introducing the first concept-bottleneck architecture for LLMs. It incorporates concept-constrained fine-tuning, interpretable neuron localization, concept-space projection, and intervention techniques into Transformer-based models to explicitly encode human-understandable semantic concepts. This enables concept-level detection and generation guidance across diverse tasks. Contribution/Results: The approach achieves text classification performance competitive with state-of-the-art black-box models while providing precise, faithful attributions. For text generation, it supports fine-grained, concept-driven control—substantially improving explanation quality and human-model collaboration. To our knowledge, this is the first work to realize an embedded, intervenable, and general-purpose interpretability mechanism within LLMs.

Technology Category

Application Category

📝 Abstract

We introduce the Concept Bottleneck Large Language Model (CB-LLM), a pioneering approach to creating inherently interpretable Large Language Models (LLMs). Unlike traditional black-box LLMs that rely on post-hoc interpretation methods with limited neuron function insights, CB-LLM sets a new standard with its built-in interpretability, scalability, and ability to provide clear, accurate explanations. We investigate two essential tasks in the NLP domain: text classification and text generation. In text classification, CB-LLM narrows the performance gap with traditional black-box models and provides clear interpretability. In text generation, we show how interpretable neurons in CB-LLM can be used for concept detection and steering text generation. Our CB-LLMs enable greater interaction between humans and LLMs across a variety of tasks -- a feature notably absent in existing LLMs. Our code is available at https://github.com/Trustworthy-ML-Lab/CB-LLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing interpretability in Large Language Models (LLMs).

Improving safety and reliability in text generation tasks.

Providing transparent and scalable explanations for NLP tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

CB-LLMs integrate intrinsic interpretability directly

CB-LLMs enable precise concept detection in generation

Embedded interpretability enhances safety and reliability

🔎 Similar Papers

No similar papers found.