🤖 AI Summary
This paper systematically reviews bottlenecks in deploying foundation models (LLMs/VLMs) for robotics, identifying five core challenges: insufficient real-time responsiveness, weak perception–action coupling, poor cross-domain generalization, limited robustness, and deficient human–robot trust. To address these, we propose a system-level embodied intelligence framework integrating procedural scene generation, multimodal reasoning, policy generalization, and sim-to-real co-training—thereby enforcing closed-loop alignment between semantic understanding and physical execution. We introduce the first end-to-end evaluation taxonomy spanning perception, planning, control, and interaction, and identify three critical gaps: embodied representation modeling, scarcity of high-fidelity multimodal robotic data, and rigorous safety verification. Based on this analysis, we chart a pragmatic research roadmap. Our work provides both theoretical foundations and engineering blueprints to advance foundation models from “language intelligence” toward “physical intelligence.”
📝 Abstract
The rapid emergence of foundation models, particularly Large Language Models (LLMs) and Vision-Language Models (VLMs), has introduced a transformative paradigm in robotics. These models offer powerful capabilities in semantic understanding, high-level reasoning, and cross-modal generalization, enabling significant advances in perception, planning, control, and human-robot interaction. This critical review provides a structured synthesis of recent developments, categorizing applications across simulation-driven design, open-world execution, sim-to-real transfer, and adaptable robotics. Unlike existing surveys that emphasize isolated capabilities, this work highlights integrated, system-level strategies and evaluates their practical feasibility in real-world environments. Key enabling trends such as procedural scene generation, policy generalization, and multimodal reasoning are discussed alongside core bottlenecks, including limited embodiment, lack of multimodal data, safety risks, and computational constraints. Through this lens, this paper identifies both the architectural strengths and critical limitations of foundation model-based robotics, highlighting open challenges in real-time operation, grounding, resilience, and trust. The review concludes with a roadmap for future research aimed at bridging semantic reasoning and physical intelligence through more robust, interpretable, and embodied models.