A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) rely on implicit correlation modeling and single-modality (e.g., RGB) image inputs, limiting their performance on fine-grained spatial reasoning tasks such as robotic visual localization. To address this, we propose a multimodal neural-symbolic framework that jointly processes panoramic images and 3D point clouds. It integrates neural perception—encompassing entity detection and attribute extraction—with symbolic reasoning to explicitly construct structured scene graphs encoding precise spatial and logical relationships among objects. This enables interpretable, query-driven joint inference. Evaluated on the JRDB-Reasoning benchmark, our approach achieves state-of-the-art localization accuracy and robustness in cluttered, human-made environments. Moreover, the framework is lightweight, making it suitable for resource-constrained robotic and embodied AI systems.

Technology Category

Application Category

📝 Abstract
Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.
Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning for robotics using multi-modal neuro-symbolic integration
Addressing limitations of vision-language models in fine-grained spatial understanding
Developing interpretable scene graphs for precise visual grounding in crowded environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuro-symbolic framework integrating panoramic and 3D data
Perception module extracts entity attributes and relationships
Symbolic reasoning constructs structured scene graphs
🔎 Similar Papers
No similar papers found.