🤖 AI Summary
Existing interpretability methods for large language models (LLMs) rely on model-specific training and manual annotation, resulting in high costs, poor generalization, and limited cross-model transferability. To address this, we propose the Concept Graph Alignment (CGA) framework—the first approach enabling cross-model interpretability transfer without sparse autoencoder training or additional annotations. CGA achieves this via lightweight representation alignment and feature mapping under shared inputs, uniformly projecting arbitrary LLMs’ latent representations onto a pre-annotated concept graph. This enables cross-model semantic retrieval and concept-guided controllable generation. Empirically, CGA matches the performance of dedicated interpretability models across multiple tasks while drastically reducing the cost of interpreting new LLMs. By decoupling interpretation from model-specific training, CGA enhances the universality, reusability, and scalability of interpretability systems.
📝 Abstract
Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.