MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

πŸ“… 2025-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing methods struggle to precisely localize fine-grained visual concepts in complex scenes. To address this, we propose MoDA, a lightweight modulation adapter that introduces, for the first time, an instruction-guided, dimension-level dynamic modulation mechanism for visual features: leveraging Transformer cross-attention over frozen visual encoders, MoDA generates semantic-aware modulation masks to enhance aligned visual features with fine-grained semantics. Crucially, MoDA requires no modification to the backbone model and is plug-and-play. Integrated within the two-stage LLaVA training paradigm, MoDA significantly improves fine-grained grounding accuracy and contextual consistency in both instruction-following and visual localization tasks. It achieves state-of-the-art performance on RefCOCO, RefCOCO+, RefCOCOg, and VQAv2 benchmarks.

Technology Category

Application Category

πŸ“ Abstract
Recently, Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle to ground fine-grained visual concepts in complex scenes. In this paper, we propose MoDA (Modulation Adapter), a lightweight yet effective module designed to refine pre-aligned visual features through instruction-guided modulation. Our approach follows the standard LLaVA training protocol, consisting of a two-stage process: (1) aligning image features to the LLMs input space via a frozen vision encoder and adapter layers, and (2) refining those features using the MoDA adapter during the instructional tuning stage. MoDA employs a Transformer-based cross-attention mechanism to generate a modulation mask over the aligned visual tokens, thereby emphasizing semantically relevant embedding dimensions based on the language instruction. The modulated features are then passed to the LLM for autoregressive language generation. Our experimental evaluation shows that MoDA improves visual grounding and generates more contextually appropriate responses, demonstrating its effectiveness as a general-purpose enhancement for image-based MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhance fine-grained visual grounding in MLLMs
Refine pre-aligned visual features via modulation
Improve instruction-guided multimodal response accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Modulation Adapter (MoDA) refines features
Transformer-based cross-attention generates modulation mask
Instruction-guided modulation enhances visual grounding
πŸ”Ž Similar Papers
No similar papers found.