Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

πŸ“… 2025-11-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Visual grounding typically relies on large-scale annotated datasets and task-specific fine-tuning, limiting generalization across domains. This paper introduces the first training-free, proxy-based visual grounding framework that achieves precise zero-shot text-to-image region alignment via joint semantic-spatial reasoning. Our method synergistically integrates an open-vocabulary object detector, a multimodal large language model (MLLM), and a pure language model, employing an iterative candidate region refinement strategy that ensures both high accuracy and strong interpretability. On the RefCOCO benchmark suite, it achieves a 65.1% average zero-shot grounding accuracy and an 89.7% selection-stage accuracyβ€”on par with supervised methods. The core contribution is the first fully fine-tuning-free visual grounding approach, eliminating dependencies on labeled data and task-specific adaptation, thereby significantly enhancing cross-distribution generalization capability.

Technology Category

Application Category

πŸ“ Abstract
Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.
Problem

Research questions and friction points this paper is trying to address.

Developing training-free visual grounding without task-specific fine-tuning
Enhancing generalization to novel scenarios through agentic reasoning
Integrating pretrained detectors and language models for spatial-semantic analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free agentic visual grounding framework
Iterative reasoning with pretrained detectors and models
Zero-shot accuracy matching supervised performance
πŸ”Ž Similar Papers
No similar papers found.
L
Liqin Luo
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Guangyao Chen
Guangyao Chen
Cornell University
Open-world LearningAutonomous AgentAI for Science
Xiawu Zheng
Xiawu Zheng
Associate Professor, IEEE Senior Member, Xiamen University
Automated Machine LearningNetwork CompressionNeural Architecture SearchAutoML
Y
Yongxing Dai
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Yixiong Zou
Yixiong Zou
Huazhong University of Science and Technology
Computer visionDomain generalizationFew-shot learningVision-language model
Y
Yonghong Tian
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University