SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language grounding methods suffer from insufficient cross-modal alignment and high computational overhead. To address these challenges, we propose a stepwise multimodal fusion and adaptation framework. Our approach introduces two key innovations: (1) Stepwise Multimodal Prompts (Swip), the first of its kind, enabling hierarchical and complementary alignment between visual and linguistic features—from shallow to deep layers, and at both token-level and weight-level; and (2) a Cross-Modal Interaction Adapter (CIA), integrating parameter-efficient fine-tuning with a lightweight architecture to drastically reduce computation. Evaluated on four standard benchmarks—RefCOCO, RefCOCO+, RefCLEF, and G-Ref—our method achieves state-of-the-art performance. It improves inference speed by 37% and reduces FLOPs by 42%, striking an optimal trade-off between accuracy and efficiency.

Technology Category

Application Category

📝 Abstract
Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level fusion manner. In addition, weight-level CIA further promotes multimodal fusion by cross-modal interaction. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually. Experimental results on four widely-used benchmarks demonstrate that SwimVG achieves remarkable abilities and considerable benefits in terms of efficiency. Our code is available at https://github.com/liuting20/SwimVG.
Problem

Research questions and friction points this paper is trying to address.

Enhances visual-language alignment efficiently
Reduces computational costs in multimodal fusion
Improves cross-modal interaction step-wise
Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-wise multimodal prompts
Cross-modal interactive adapters
Parameter-efficient fusion paradigms
🔎 Similar Papers
No similar papers found.
L
Liangtao Shi
Key Laboratory of Knowledge Engineering with Big Data, Hefei University of Technology, Hefei 230009, China, and also with the Ministry of Education and School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China
T
Ting Liu
School of Systems Engineering, National University of Defense Technology, Changsha, Hunan Province, 410073, China
Xiantao Hu
Xiantao Hu
Nanjing University of Science & Technology
Computer VIsion
Y
Yue Hu
School of Systems Engineering, National University of Defense Technology, Changsha, Hunan Province, 410073, China
Q
Quanjun Yin
School of Systems Engineering, National University of Defense Technology, Changsha, Hunan Province, 410073, China
Richang Hong
Richang Hong
Hefei University of Technology
MultimediaPattern Recognition