🤖 AI Summary
Interpretability of neural network internal activations has long been constrained by hand-crafted assumptions and scalability limitations of surrogate models.
Method: We propose the first end-to-end trainable interpretability assistant that frames interpretability as a prediction task: a sparse concept encoder—acting as a communication bottleneck—maps internal activations to data-driven, natural-language concepts, while an autoregressive decoder directly predicts model behavior. Our approach employs a two-stage paradigm—self-supervised pretraining followed by instruction fine-tuning—and introduces an automatic evaluation metric (auto-interp score) to optimize bottleneck quality.
Results: Experiments demonstrate significant improvements over baselines across diverse tasks—including jailbreak detection, implicit prompt identification, latent concept injection, and user attribute inference. The learned concept representations exhibit strong cross-task generalization, and both bottleneck quality and downstream performance scale consistently with data volume.
📝 Abstract
Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.