🤖 AI Summary
This work addresses the limitations of existing vision-language models in zero-shot referring expression comprehension (REC), which struggle to capture fine-grained visual details and complex object relationships, as well as the inability of large language models (LLMs) to process images directly. To bridge this gap, the authors propose SGREC, a novel approach that introduces a query-driven scene graph as a structured intermediary. This graph integrates CLIP-derived region features, spatial relations, descriptive captions, and object interaction cues to align visual content with natural language queries and guide the LLM toward interpretable, structured reasoning. SGREC achieves state-of-the-art performance on multiple zero-shot REC benchmarks, reporting 66.78% on RefCOCO val, 53.43% on RefCOCO+ testB, and 73.28% on RefCOCOg val, demonstrating both high accuracy and strong interpretability in its decision-making process.
📝 Abstract
Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.