ResiDual Transformer Alignment with Spectral Decomposition

📅 2024-10-31

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work identifies task- and attribute-specific specialization among attention heads in the residual stream of vision Transformers, revealing that their spectral geometric structure—captured by low-dimensional principal components—encodes input semantics critical for cross-modal alignment and zero-shot classification in vision-language models. To address this, we propose ResiDual: a parameter-efficient, interpretable alignment method grounded in dual-perspective spectral decomposition of the residual stream. ResiDual establishes, for the first time, a generalizable correlation between head specialization degree and zero-shot performance. Evaluated across 50+ pretrained model–dataset combinations, it achieves fine-tuning–level zero-shot accuracy, significantly improves modality alignment fidelity, and incurs negligible parameter overhead (< 0.1M). The method delivers strong interpretability—via spectral head characterization—and broad generalization across architectures and datasets.

Technology Category

Application Category

📝 Abstract

When examined through the lens of their residual streams, a puzzling property emerges in transformer networks: residual contributions (e.g., attention heads) sometimes specialize in specific tasks or input attributes. In this paper, we analyze this phenomenon in vision transformers, focusing on the spectral geometry of residuals, and explore its implications for modality alignment in vision-language models. First, we link it to the intrinsically low-dimensional structure of visual head representations, zooming into their principal components and showing that they encode specialized roles across a wide variety of input data distributions. Then, we analyze the effect of head specialization in multimodal models, focusing on how improved alignment between text and specialized heads impacts zero-shot classification performance. This specialization-performance link consistently holds across diverse pre-training data, network sizes, and objectives, demonstrating a powerful new mechanism for boosting zero-shot classification through targeted alignment. Ultimately, we translate these insights into actionable terms by introducing ResiDual, a technique for spectral alignment of the residual stream. Much like panning for gold, it lets the noise from irrelevant unit principal components (i.e., attributes) wash away to amplify task-relevant ones. Remarkably, this dual perspective on modality alignment yields fine-tuning level performances on different data distributions while modeling an extremely interpretable and parameter-efficient transformation, as we extensively show on more than 50 (pre-trained network, dataset) pairs.

Problem

Research questions and friction points this paper is trying to address.

Analyze residual specialization in vision transformers

Explore modality alignment in vision-language models

Improve zero-shot classification via spectral alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectral decomposition of transformer residual streams

Targeted alignment of specialized attention heads

Parameter-efficient spectral alignment technique ResiDual

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation