VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing visual grounding methods face three key bottlenecks: slow and hallucination-prone autoregressive MLLM decoding; degradation of pre-trained LLM reasoning capability due to fine-tuning. This paper proposes a modular encoder-decoder architecture that decouples reasoning from localization for the first time: a frozen multimodal large language model (MLLM) serves as a fixed reasoning encoder, while a lightweight detection-box-driven decoder dynamically selects targets via cross-modal attention. We further introduce QuadThinker—a novel reinforcement learning paradigm—to enhance multi-object logical reasoning, integrated with mask-aware label supervision and a global object recognition mechanism. On multi-object visual grounding benchmarks, our method achieves absolute improvements of 20.6%, 8.2%, and 5.8% in F1, generalized IoU (gIoU), and centered IoU (cIoU), respectively. Crucially, inference latency remains constant and is significantly reduced.

Technology Category

Application Category

📝 Abstract

Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM's pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.

Problem

Research questions and friction points this paper is trying to address.

Disentangles reasoning and prediction for visual grounding

Avoids auto-regressive decoding to prevent hallucinations

Enables fast inference with modular encoder-decoder design

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular encoder-decoder architecture disentangles reasoning and prediction

Frozen MLLM encoder preserves reasoning, decoder selects boxes via cross-attention

QuadThinker training and mask-aware labels enhance multi-target reasoning and detection

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling