MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contemporary multimodal large language models (MLLMs) suffer from inconsistent cross-modal attention and progressive layer-wise attenuation, hindering fine-grained perception, cognition, and affective understanding in advanced multimodal tasks. To address these limitations, we propose Modular Dual-path Attention (MODA), a novel architecture featuring three key innovations: (1) a “correct-post-alignment” strategy that decouples modality alignment from cross-layer token mixing; (2) adaptive masked attention enabling modality-specific interaction patterns; and (3) a unified bimodal embedding space constructed via foundational vector mapping. MODA preserves semantic fidelity while enhancing cross-modal coherence across layers. We comprehensively evaluate MODA on 21 diverse multimodal benchmarks—including visual reasoning, emotion recognition, and compositional understanding—demonstrating consistent and significant improvements over state-of-the-art MLLMs. All code and interactive demos are publicly released.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks. Source code and demo are available in https://zzcheng.top/MODA.
Problem

Research questions and friction points this paper is trying to address.

Addresses inconsistent cross-modal attention in MLLMs
Solves layer-by-layer decayed attention activation issue
Enhances fine-grained cognition and emotion understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

MODA: Modular Duplex Attention mechanism
Correct-after-align strategy for modality alignment
Adaptive masked attention for flexible masking
🔎 Similar Papers
No similar papers found.
Zhicheng Zhang
Zhicheng Zhang
Carnegie Mellon University
Reinforcement LearningExplainable RL
W
Wuyou Xia
VCIP & TMCC & DISSec, College of Computer Science, Nankai University
C
Chenxi Zhao
VCIP & TMCC & DISSec, College of Computer Science, Nankai University
Z
Zhou Yan
Kuaishou Technology
X
Xiaoqiang Liu
Kuaishou Technology
Y
Yongjie Zhu
Kuaishou Technology
Wenyu Qin
Wenyu Qin
Harbin Institute of Technology
Control
Pengfei Wan
Pengfei Wan
Head of Kling Video Generation Models, Kuaishou Technology
Generative ModelsComputer VisionMultimodal AIComputer Graphics
D
Di Zhang
Kuaishou Technology
Jufeng Yang
Jufeng Yang
Nankai University
Computer visionMachine learningMultimedia