JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

📅 2022-08-28

🏛️ arXiv.org

📈 Citations: 44

✨ Influential: 3

career value

196K/year

🤖 AI Summary

Conversational embodied agents for real-world tasks face synergistic challenges in multimodal perception, long-horizon decision-making, and interpretable reasoning. To address these, we propose a neuro-symbolic fusion framework featuring: (i) the first LLM-driven symbolic representation acquisition method jointly modeled with visual-semantic mapping; and (ii) a modular symbolic reasoning mechanism guided by task-level and action-level commonsense knowledge, balancing generalizability, interpretability, and few-shot adaptability. The framework integrates prompt engineering, semantic map construction, a symbolic planning engine, and a neuro-symbolic collaborative reasoning architecture. On the TEACh benchmark, our approach achieves state-of-the-art performance across all three conversational embodied tasks. Notably, success rate on unseen scenes in the EDH setting improves significantly—from 6.1% to 15.8%. Furthermore, the framework secured first place in the Alexa Prize SimBot Challenge.

📝 Abstract

Building a conversational embodied agent to execute real-life tasks has been a long-standing yet quite challenging research goal, as it requires effective human-agent communication, multi-modal understanding, long-range sequential decision making, etc. Traditional symbolic methods have scaling and generalization issues, while end-to-end deep learning models suffer from data scarcity and high task complexity, and are often hard to explain. To benefit from both worlds, we propose JARVIS, a neuro-symbolic commonsense reasoning framework for modular, generalizable, and interpretable conversational embodied agents. First, it acquires symbolic representations by prompting large language models (LLMs) for language understanding and sub-goal planning, and by constructing semantic maps from visual observations. Then the symbolic module reasons for sub-goal planning and action generation based on task- and action-level common sense. Extensive experiments on the TEACh dataset validate the efficacy and efficiency of our JARVIS framework, which achieves state-of-the-art (SOTA) results on all three dialog-based embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC) (e.g., our method boosts the unseen Success Rate on EDH from 6.1% to 15.8%). Moreover, we systematically analyze the essential factors that affect the task performance and also demonstrate the superiority of our method in few-shot settings. Our JARVIS model ranks first in the Alexa Prize SimBot Public Benchmark Challenge.

Problem

Research questions and friction points this paper is trying to address.

Develops neuro-symbolic framework for conversational embodied agents

Addresses data scarcity and explainability in task execution

Integrates commonsense reasoning for multimodal language understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuro-symbolic framework combining LLMs and semantic maps

Symbolic reasoning for sub-goal planning and action generation

Leverages task- and action-level commonsense knowledge

🔎 Similar Papers

No similar papers found.