NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited capability of existing methods in performing linguistic logical reasoning—such as negation, quantification, and compositional constraints—in complex visual referring localization and segmentation tasks, this paper proposes a neuro-symbolic collaborative framework. Our method introduces a novel symbolic reasoning kernel by embedding probabilistic logic inference into finite-state automata (FSA), ensuring interpretability and robustness. We further design an LLM–VLM collaborative decoupling mechanism that explicitly separates language understanding from visual grounding. Additionally, we incorporate a self-correcting feedback loop to dynamically verify and refine the reasoning process. Extensive experiments on mainstream benchmarks—including RefCOCO, RefCOCO+, and RefCOCOg—demonstrate significant performance gains over both end-to-end models and state-of-the-art compositional approaches, achieving new SOTA results. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Visual Grounding (VG) tasks, such as referring expression detection and segmentation tasks are important for linking visual entities to context, especially in complex reasoning tasks that require detailed query interpretation. This paper explores VG beyond basic perception, highlighting challenges for methods that require reasoning like human cognition. Recent advances in large language methods (LLMs) and Vision-Language methods (VLMs) have improved abilities for visual comprehension, contextual understanding, and reasoning. These methods are mainly split into end-to-end and compositional methods, with the latter offering more flexibility. Compositional approaches that integrate LLMs and foundation models show promising performance but still struggle with complex reasoning with language-based logical representations. To address these limitations, we propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning within a finite-state automaton, equipped with a self-correcting mechanism. This design improves robustness and interpretability in inference through explicit logic reasoning. Our results show that NAVER achieves SoTA performance comparing to recent end-to-end and compositional baselines. The code is available at https://github.com/ControlNet/NAVER .
Problem

Research questions and friction points this paper is trying to address.

Complex Language Logic Reasoning
Visual Localization Tasks
Model Limitations in Logic
Innovation

Methods, ideas, or system contributions that make the work stand out.

NAVER
Combination Method
Advanced Visual Localization