Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
Existing sparse autoencoders are limited to single-layer analysis and struggle to capture the cross-layer computational structure of Vision Transformers (ViTs) or quantify each layer’s contribution to the final representation. This work proposes the Cross-Layer Transcoder (CLT), the first depth-aware sparse surrogate model that employs an encoder–decoder architecture to reconstruct post-MLP activations at each layer using sparse embeddings from preceding layers. By doing so, CLT decomposes the ViT’s final representation into an additive, hierarchically interpretable form. Experiments on CIFAR-100, COCO, and ImageNet-100 demonstrate that CLT achieves high-fidelity activation reconstruction while preserving or even improving CLIP’s zero-shot classification accuracy. Moreover, it reveals that the final representation is predominantly driven by a few critical layers, substantially enhancing process-level interpretability and attribution faithfulness.

Technology Category

Application Category

📝 Abstract
Understanding the internal activations of Vision Transformers (ViTs) is critical for building interpretable and trustworthy models. While Sparse Autoencoders (SAEs) have been used to extract human-interpretable features, they operate on individual layers and fail to capture the cross-layer computational structure of Transformers, as well as the relative significance of each layer in forming the last-layer representation. Alternatively, we introduce the adoption of Cross-Layer Transcoders (CLTs) as reliable, sparse, and depth-aware proxy models for MLP blocks in ViTs. CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, yielding a linear decomposition that transforms the final representation of ViTs from an opaque embedding into an additive, layer-resolved construction that enables faithful attribution and process-level interpretability. We train CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100. We show that CLTs achieve high reconstruction fidelity with post-MLP activations while preserving and even improving, in some cases, CLIP zero-shot classification accuracy. In terms of interpretability, we show that the cross-layer contribution scores provide faithful attribution, revealing that the final representation is concentrated in a smaller set of dominant layer-wise terms whose removal degrades performance and whose retention largely preserves it. These results showcase the significance of adopting CLTs as an alternative interpretable proxy of ViTs in the vision domain.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformers
Interpretability
Cross-Layer Transcoders
Sparse Autoencoders
Model Transparency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Layer Transcoders
Vision Transformers
Interpretability
Sparse Embeddings
Layer-wise Attribution