🤖 AI Summary
Current vision-language models (VLMs) rely on implicit correlation modeling and single-modality (e.g., RGB) image inputs, limiting their performance on fine-grained spatial reasoning tasks such as robotic visual localization. To address this, we propose a multimodal neural-symbolic framework that jointly processes panoramic images and 3D point clouds. It integrates neural perception—encompassing entity detection and attribute extraction—with symbolic reasoning to explicitly construct structured scene graphs encoding precise spatial and logical relationships among objects. This enables interpretable, query-driven joint inference. Evaluated on the JRDB-Reasoning benchmark, our approach achieves state-of-the-art localization accuracy and robustness in cluttered, human-made environments. Moreover, the framework is lightweight, making it suitable for resource-constrained robotic and embodied AI systems.
📝 Abstract
Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.