Interpretability Transfer from Language to Vision via Sparse Autoencoders

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of applying sparse autoencoders (SAEs)—originally developed for language models—to visual interpretability, which is hindered by the difficulty of obtaining annotated visual concepts. The authors propose VISTA, a novel framework that enables cross-modal interpretability transfer from language to vision by constraining a visual projector to map image tokens into the semantic space of a pretrained text-based SAE, thereby eliminating the need for a dedicated visual SAE. Leveraging the interpretability of text SAEs, VISTA reveals that DINOv2 features exhibit superior spatial localization capabilities. Experiments demonstrate that VISTA outperforms purely visual baselines by 35% and 47% on object removal and replacement tasks, respectively, and achieves a threefold improvement in visual-textual concept alignment, confirming that visual tokens indeed lie on the manifold defined by the text SAE.

📝 Abstract

Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision in a LLaVA-style vision-language model by constraining a visual projector to map visual tokens into an LLM's pre-existing, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM's SAE reconstruction loss, VISTA achieves a threefold increase in the matching rate, which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image. Using this framework, we further analyze spatial localization properties of different vision encoders and show that DINOv2 features have stronger localization abilities than other encoders. Leveraging this precision, we validate VISTA's cross-modal alignment through fine-grained, localized concept interventions, where specific objects are removed or replaced in the model's perception while preserving the surrounding scene. This results in improvements of 35% in object removal and 47% in object replacement tasks over vision-only baselines, providing causal evidence that visual tokens inhabit the text SAE manifold. These contributions are validated across multiple LLM architectures.

Problem

Research questions and friction points this paper is trying to address.

interpretability transfer

sparse autoencoders

vision-language models

visual concept labeling

cross-modal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders

Interpretability Transfer

Vision-Language Models