🤖 AI Summary
This work addresses the semantic gap between natural language instructions and robotic physical actions to enhance the naturalness and reliability of human-robot collaboration. We propose the first four-dimensional taxonomy for language-conditioned robotic manipulation—comprising reward shaping, policy learning, neurosymbolic AI, and foundation model–driven approaches—and systematically analyze their fundamental limitations in generalization and safety. Integrating large language models (LLMs), vision-language models (VLMs), neurosymbolic reasoning, and multimodal semantic parsing, we develop a unified analytical framework spanning semantic extraction, environmental assessment, and auxiliary task design. Our analysis rigorously characterizes the performance boundaries of each paradigm for the first time, establishing theoretical foundations and concrete technical pathways toward safe, generalizable, and interpretable language-driven robotic systems.
📝 Abstract
Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robotic actions. In this comprehensive survey, we systematically explore recent advancements in language-conditioned robotic manipulation. We categorize existing methods into language-conditioned reward shaping, language-conditioned policy learning, neuro-symbolic artificial intelligence, and the utilization of foundational models (FMs) such as large language models (LLMs) and vision-language models (VLMs). Specifically, we analyze state-of-the-art techniques concerning semantic information extraction, environment and evaluation, auxiliary tasks, and task representation strategies. By conducting a comparative analysis, we highlight the strengths and limitations of current approaches in bridging language instructions with robot actions. Finally, we discuss open challenges and future research directions, focusing on potentially enhancing generalization capabilities and addressing safety issues in language-conditioned robot manipulators.