Bridging Bots: from Perception to Action via Multimodal-LMs and Knowledge Graphs

📅 2025-07-13

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Service robots face challenges including proprietary system lock-in, poor cross-platform adaptability, and fragmented perception-decision-execution pipelines. This paper proposes a neuro-symbolic integration framework that synergistically combines the raw multimodal perception capabilities of large language and vision models (e.g., GPT-o1, LLaMA 4 Maverick) with the structured reasoning of knowledge graphs. Leveraging ontology-guided neuro-symbolic interaction, the framework automatically generates semantically compliant, standardized knowledge graphs from visual and linguistic inputs—enabling transparent, verifiable decision-making and cross-platform interoperability. Experimental results demonstrate that performance hinges not on model parameter count or recency, but critically on the design of neural-symbolic coordination mechanisms. The generated knowledge graphs significantly outperform baselines in logical consistency and ontology compliance, thereby enhancing adaptive behavior and explainability of service robots in domestic environments.

Technology Category

Application Category

📝 Abstract

Personal service robots are deployed to support daily living in domestic environments, particularly for elderly and individuals requiring assistance. These robots must perceive complex and dynamic surroundings, understand tasks, and execute context-appropriate actions. However, current systems rely on proprietary, hard-coded solutions tied to specific hardware and software, resulting in siloed implementations that are difficult to adapt and scale across platforms. Ontologies and Knowledge Graphs (KGs) offer a solution to enable interoperability across systems, through structured and standardized representations of knowledge and reasoning. However, symbolic systems such as KGs and ontologies struggle with raw and noisy sensory input. In contrast, multimodal language models are well suited for interpreting input such as images and natural language, but often lack transparency, consistency, and knowledge grounding. In this work, we propose a neurosymbolic framework that combines the perceptual strengths of multimodal language models with the structured representations provided by KGs and ontologies, with the aim of supporting interoperability in robotic applications. Our approach generates ontology-compliant KGs that can inform robot behavior in a platform-independent manner. We evaluated this framework by integrating robot perception data, ontologies, and five multimodal models (three LLaMA and two GPT models), using different modes of neural-symbolic interaction. We assess the consistency and effectiveness of the generated KGs across multiple runs and configurations, and perform statistical analyzes to evaluate performance. Results show that GPT-o1 and LLaMA 4 Maverick consistently outperform other models. However, our findings also indicate that newer models do not guarantee better results, highlighting the critical role of the integration strategy in generating ontology-compliant KGs.

Problem

Research questions and friction points this paper is trying to address.

Integrate multimodal-LMs with KGs for robot perception and action.

Overcome proprietary, hard-coded limitations in current robotic systems.

Ensure interoperability and scalability in robotic applications via neurosymbolic frameworks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines multimodal-LMs with knowledge graphs

Generates ontology-compliant knowledge graphs

Integrates perception data and multiple models

🔎 Similar Papers

No similar papers found.