🤖 AI Summary
Current vision-language models (e.g., CLIP) typically fine-tune unimodal representations while neglecting the critical role of the final cross-modal fusion decision matrix. To address this, we propose End-stage Attention Micro-tuning (EAM), a lightweight, learnable attention masking mechanism applied directly at the output layer of pre-trained models. EAM dynamically optimizes the weighted fusion of multimodal representations without modifying encoder architectures or intermediate features. The method is agnostic to encoder training strategies—supporting both frozen and fine-tuned encoders—and seamlessly integrates into diverse fine-tuning and test-time training paradigms. Empirical evaluation demonstrates consistent performance gains across multiple downstream tasks, outperforming standard baselines. Implementation is concise, requiring minimal additional parameters, yet achieves competitive accuracy with state-of-the-art approaches.
📝 Abstract
Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, emph{ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective extbf{R}ational extbf{Ada}ptaion ({RAda}) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e., updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings. Code is available at href{https://github.com/khufia/RAda/tree/main}{github.com/khufia/RAda}.