Atlas-Alignment: Making Interpretability Transferable Across Language Models

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing interpretability methods for large language models (LLMs) rely on model-specific training and manual annotation, resulting in high costs, poor generalization, and limited cross-model transferability. To address this, we propose the Concept Graph Alignment (CGA) framework—the first approach enabling cross-model interpretability transfer without sparse autoencoder training or additional annotations. CGA achieves this via lightweight representation alignment and feature mapping under shared inputs, uniformly projecting arbitrary LLMs’ latent representations onto a pre-annotated concept graph. This enables cross-model semantic retrieval and concept-guided controllable generation. Empirically, CGA matches the performance of dedicated interpretability models across multiple tasks while drastically reducing the cost of interpreting new LLMs. By decoupling interpretation from model-specific training, CGA enhances the universality, reusability, and scalability of interpretability systems.

Technology Category

Application Category

📝 Abstract

Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.

Problem

Research questions and friction points this paper is trying to address.

Transferring interpretability across language models using alignment techniques

Enabling semantic feature search and steerable generation in opaque models

Reducing costs of explainable AI through reusable Concept Atlas framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transferring interpretability across models via alignment

Using shared inputs and lightweight alignment techniques

Enabling semantic search and steerable generation without labeled data

🔎 Similar Papers

No similar papers found.