Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Interpretability of neural network internal activations has long been constrained by hand-crafted assumptions and scalability limitations of surrogate models. Method: We propose the first end-to-end trainable interpretability assistant that frames interpretability as a prediction task: a sparse concept encoder—acting as a communication bottleneck—maps internal activations to data-driven, natural-language concepts, while an autoregressive decoder directly predicts model behavior. Our approach employs a two-stage paradigm—self-supervised pretraining followed by instruction fine-tuning—and introduces an automatic evaluation metric (auto-interp score) to optimize bottleneck quality. Results: Experiments demonstrate significant improvements over baselines across diverse tasks—including jailbreak detection, implicit prompt identification, latent concept injection, and user attribute inference. The learned concept representations exhibit strong cross-task generalization, and both bottleneck quality and downstream performance scale consistently with data volume.

Technology Category

Application Category

📝 Abstract

Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.

Problem

Research questions and friction points this paper is trying to address.

Training interpretability assistants to predict model behavior from activations

Compressing activations to sparse concepts via communication bottleneck

Detecting jailbreaks, secret hints, and latent concepts in neural networks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training interpretability assistants via end-to-end objective

Encoder compresses activations to sparse concept list

Pretraining and finetuning for scalable concept decoding

🔎 Similar Papers

Self-supervised Interpretable Concept-based Models for Text Classification