π€ AI Summary
Visual grounding typically relies on large-scale annotated datasets and task-specific fine-tuning, limiting generalization across domains. This paper introduces the first training-free, proxy-based visual grounding framework that achieves precise zero-shot text-to-image region alignment via joint semantic-spatial reasoning. Our method synergistically integrates an open-vocabulary object detector, a multimodal large language model (MLLM), and a pure language model, employing an iterative candidate region refinement strategy that ensures both high accuracy and strong interpretability. On the RefCOCO benchmark suite, it achieves a 65.1% average zero-shot grounding accuracy and an 89.7% selection-stage accuracyβon par with supervised methods. The core contribution is the first fully fine-tuning-free visual grounding approach, eliminating dependencies on labeled data and task-specific adaptation, thereby significantly enhancing cross-distribution generalization capability.
π Abstract
Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.