VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension

📅 2026-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of existing neuro-symbolic approaches to cascading failures in referring expression comprehension, particularly their poor robustness when no target object exists in the image. To mitigate error propagation, the authors propose a decoupled architecture for program generation and execution, introducing— for the first time in neuro-symbolic reasoning—a lightweight, operation-level verifier that concurrently validates conditions such as object existence or spatial relationships at each reasoning step. By integrating large language models, vision-language models, and verifiable symbolic operators, the method achieves a balanced accuracy of 61.1% across both target-present and target-absent scenarios, with a program failure rate below 0.3%. This significantly enhances the system’s reliability, accuracy, and real-world generalization, especially in challenging no-target cases.

Technology Category

Application Category

📝 Abstract
Referring Expression Comprehension (REC) aims to localize the image region corresponding to a natural-language query. Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning, decomposing queries 4 structured programs and executing them step-by-step. While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate. However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning chain, yielding high-confidence false positives even when no target is present in the image. To address this limitation, we introduce Verification-Integrated Reasoning Operators (VIRO), a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps. Each operator executes and validates its output, such as object existence or spatial relationship, thereby allowing the system to robustly handle no-target cases when verification conditions are not met. Our framework achieves state-of-the-art performance, reaching 61.1% balanced accuracy across target-present and no-target settings, and demonstrates generalization to real-world egocentric data. Furthermore, VIRO shows superior computational efficiency in terms of throughput, high reliability with a program failure rate of less than 0.3%, and scalability through decoupled program generation from execution.
Problem

Research questions and friction points this paper is trying to address.

Referring Expression Comprehension
neuro-symbolic reasoning
cascading errors
false positives
no-target cases
Innovation

Methods, ideas, or system contributions that make the work stand out.

neuro-symbolic reasoning
verification
referring expression comprehension
zero-shot generalization
error propagation
🔎 Similar Papers
No similar papers found.