RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multimodal Transformers, heterogeneous modality quality induces degradation of attention dynamic adaptability: models fall into a self-reinforcing modality preference loop, causing cross-modal collaboration failure and imbalanced attention key distributions. To address this, we propose RollingQ—a learnable cyclic shift operation applied to queries—that breaks entrenched modality preferences. RollingQ is the first method to systematically identify and rectify the loss of dynamic adaptability stemming directly from key distribution imbalance. Integrated with modality alignment regularization, it enables parameter-free dynamic recalibration within the standard multi-head attention framework. Evaluated across vision-language and speech-text multimodal tasks, RollingQ achieves an average performance gain of 2.1%. It significantly restores attention’s dynamic responsiveness and reduces modality bias by 73%, demonstrating both theoretical insight and practical efficacy in mitigating modality-induced attention collapse.

Technology Category

Application Category

📝 Abstract
Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenge by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism's dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which balances attention allocation by rotating the query to break the self-reinforcing cycle and mitigate the key distribution gap. Extensive experiments on various multimodal scenarios validate the effectiveness of RollingQ and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers. The source code is available at https://github.com/GeWu-Lab/RollingQ_ICML2025.
Problem

Research questions and friction points this paper is trying to address.

Multimodal learning struggles with uneven modality quality
Self-attention models lose adaptability, favoring one modality
RollingQ revives dynamic fusion by balancing attention allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rotating query to balance attention allocation
Breaking self-reinforcing cycle in attention
Mitigating key distribution gap across modalities
🔎 Similar Papers
No similar papers found.
H
Haotian Ni
Beihang University, Beijing, China
Yake Wei
Yake Wei
Renmin University of China
multimodal learning
H
Hang Liu
Xiamen University, Xiamen, China
Gong Chen
Gong Chen
Nanjing University
Magnetic imaging
Chong Peng
Chong Peng
Qingdao University
机器学习、计算机视觉
H
Hao Lin
Tencent, Shenzhen, China
D
Di Hu
Gaoling School of Artificial Intelligence Renmin University of China, Beijing, China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Beijing, China; Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE, Beijing, China