Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Visual grounding typically relies on large-scale annotated datasets and task-specific fine-tuning, limiting generalization across domains. This paper introduces the first training-free, proxy-based visual grounding framework that achieves precise zero-shot text-to-image region alignment via joint semantic-spatial reasoning. Our method synergistically integrates an open-vocabulary object detector, a multimodal large language model (MLLM), and a pure language model, employing an iterative candidate region refinement strategy that ensures both high accuracy and strong interpretability. On the RefCOCO benchmark suite, it achieves a 65.1% average zero-shot grounding accuracy and an 89.7% selection-stage accuracy—on par with supervised methods. The core contribution is the first fully fine-tuning-free visual grounding approach, eliminating dependencies on labeled data and task-specific adaptation, thereby significantly enhancing cross-distribution generalization capability.

Technology Category

Application Category

📝 Abstract

Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

Problem

Research questions and friction points this paper is trying to address.

Developing training-free visual grounding without task-specific fine-tuning

Enhancing generalization to novel scenarios through agentic reasoning

Integrating pretrained detectors and language models for spatial-semantic analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free agentic visual grounding framework

Iterative reasoning with pretrained detectors and models

Zero-shot accuracy matching supervised performance

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling