MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual loss of spatial structure and semantic information after visual encoding in multimodal models—leading to insufficient coupling between visual encoders and large language models—this paper proposes the Intelligent Alignment Network (IAN). IAN employs learnable cross-modal mappings to achieve simultaneous alignment in both dimensionality and semantics. We further design a multi-task loss function that jointly optimizes cross-entropy and mean squared error, mitigating representational discrepancies among semantically equivalent but structurally heterogeneous data. Additionally, we construct a multimodal tool-use instruction-tuning dataset to enhance Any-to-Any generation capability. Extensive experiments on benchmarks including MME, MMBench, and SEED demonstrate significant improvements over state-of-the-art methods, validating the efficacy of our cross-modal alignment strategy. The code and supplementary materials are publicly available.

Technology Category

Application Category

📝 Abstract
In the latest advancements in multimodal learning, effectively addressing the spatial and semantic losses of visual data after encoding remains a critical challenge. This is because the performance of large multimodal models is positively correlated with the coupling between visual encoders and large language models. Existing approaches often face issues such as vector gaps or semantic disparities, resulting in information loss during the propagation process. To address these issues, we propose MAGE (Multimodal Alignment and Generation Enhancement), a novel framework that bridges the semantic spaces of vision and text through an innovative alignment mechanism. By introducing the Intelligent Alignment Network (IAN), MAGE achieves dimensional and semantic alignment. To reduce the gap between synonymous heterogeneous data, we employ a training strategy that combines cross-entropy and mean squared error, significantly enhancing the alignment effect. Moreover, to enhance MAGE's "Any-to-Any" capability, we developed a fine-tuning dataset for multimodal tool-calling instructions to expand the model's output capability boundaries. Finally, our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks, including MME, MMBench, and SEED. Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE.
Problem

Research questions and friction points this paper is trying to address.

Addressing spatial and semantic losses in visual data encoding
Bridging semantic gaps between vision and text modalities
Enhancing multimodal model performance via alignment and generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bridges visual and text semantic spaces
Uses Intelligent Alignment Network (IAN)
Combines cross-entropy and MSE training
🔎 Similar Papers
No similar papers found.
S
Shaojun E
Global Tone Communication Technology Co., Ltd., Beijing, China; School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China
Y
Yuchen Yang
Faculty of computing, Harbin Institute of Technology, Harbin, China
J
Jiaheng Wu
Faculty of computing, Harbin Institute of Technology, Harbin, China
Y
Yan Zhang
Global Tone Communication Technology Co., Ltd., Beijing, China
T
Tiejun Zhao
Faculty of computing, Harbin Institute of Technology, Harbin, China
Ziyan Chen
Ziyan Chen
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Generative AILow Level Vision