Foundation Model Driven Robotics: A Comprehensive Review

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper systematically reviews bottlenecks in deploying foundation models (LLMs/VLMs) for robotics, identifying five core challenges: insufficient real-time responsiveness, weak perception–action coupling, poor cross-domain generalization, limited robustness, and deficient human–robot trust. To address these, we propose a system-level embodied intelligence framework integrating procedural scene generation, multimodal reasoning, policy generalization, and sim-to-real co-training—thereby enforcing closed-loop alignment between semantic understanding and physical execution. We introduce the first end-to-end evaluation taxonomy spanning perception, planning, control, and interaction, and identify three critical gaps: embodied representation modeling, scarcity of high-fidelity multimodal robotic data, and rigorous safety verification. Based on this analysis, we chart a pragmatic research roadmap. Our work provides both theoretical foundations and engineering blueprints to advance foundation models from “language intelligence” toward “physical intelligence.”

Technology Category

Application Category

📝 Abstract
The rapid emergence of foundation models, particularly Large Language Models (LLMs) and Vision-Language Models (VLMs), has introduced a transformative paradigm in robotics. These models offer powerful capabilities in semantic understanding, high-level reasoning, and cross-modal generalization, enabling significant advances in perception, planning, control, and human-robot interaction. This critical review provides a structured synthesis of recent developments, categorizing applications across simulation-driven design, open-world execution, sim-to-real transfer, and adaptable robotics. Unlike existing surveys that emphasize isolated capabilities, this work highlights integrated, system-level strategies and evaluates their practical feasibility in real-world environments. Key enabling trends such as procedural scene generation, policy generalization, and multimodal reasoning are discussed alongside core bottlenecks, including limited embodiment, lack of multimodal data, safety risks, and computational constraints. Through this lens, this paper identifies both the architectural strengths and critical limitations of foundation model-based robotics, highlighting open challenges in real-time operation, grounding, resilience, and trust. The review concludes with a roadmap for future research aimed at bridging semantic reasoning and physical intelligence through more robust, interpretable, and embodied models.
Problem

Research questions and friction points this paper is trying to address.

How foundation models enhance robotics perception and planning
Challenges in real-world application of multimodal robotics
Bridging semantic reasoning with physical robot intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Foundation models enhance semantic understanding and reasoning
Integrated system-level strategies for real-world robotics
Procedural scene generation and multimodal reasoning techniques
🔎 Similar Papers
No similar papers found.
Muhammad Tayyab Khan
Muhammad Tayyab Khan
Nanyang Technological University Singapore; Singapore Institute of Manufacturing Technology, A*STAR
Smart ManufacturingMulti-Agent SystemsLLMsKnowledge GraphsAI
A
Ammar Waheed
J. Mike Walker ’66 Department of Mechanical Engineering, Texas A&M University, College Station, TX 77801, USA