SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Current vision-language grounding methods suffer from insufficient cross-modal alignment and high computational overhead. To address these challenges, we propose a stepwise multimodal fusion and adaptation framework. Our approach introduces two key innovations: (1) Stepwise Multimodal Prompts (Swip), the first of its kind, enabling hierarchical and complementary alignment between visual and linguistic features—from shallow to deep layers, and at both token-level and weight-level; and (2) a Cross-Modal Interaction Adapter (CIA), integrating parameter-efficient fine-tuning with a lightweight architecture to drastically reduce computation. Evaluated on four standard benchmarks—RefCOCO, RefCOCO+, RefCLEF, and G-Ref—our method achieves state-of-the-art performance. It improves inference speed by 37% and reduces FLOPs by 42%, striking an optimal trade-off between accuracy and efficiency.

Technology Category

Application Category

📝 Abstract

Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level fusion manner. In addition, weight-level CIA further promotes multimodal fusion by cross-modal interaction. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually. Experimental results on four widely-used benchmarks demonstrate that SwimVG achieves remarkable abilities and considerable benefits in terms of efficiency. Our code is available at https://github.com/liuting20/SwimVG.

Problem

Research questions and friction points this paper is trying to address.

Enhances visual-language alignment efficiently

Reduces computational costs in multimodal fusion

Improves cross-modal interaction step-wise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-wise multimodal prompts

Cross-modal interactive adapters

Parameter-efficient fusion paradigms

🔎 Similar Papers

No similar papers found.