Foundation Model Driven Robotics: A Comprehensive Review

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This paper systematically reviews bottlenecks in deploying foundation models (LLMs/VLMs) for robotics, identifying five core challenges: insufficient real-time responsiveness, weak perception–action coupling, poor cross-domain generalization, limited robustness, and deficient human–robot trust. To address these, we propose a system-level embodied intelligence framework integrating procedural scene generation, multimodal reasoning, policy generalization, and sim-to-real co-training—thereby enforcing closed-loop alignment between semantic understanding and physical execution. We introduce the first end-to-end evaluation taxonomy spanning perception, planning, control, and interaction, and identify three critical gaps: embodied representation modeling, scarcity of high-fidelity multimodal robotic data, and rigorous safety verification. Based on this analysis, we chart a pragmatic research roadmap. Our work provides both theoretical foundations and engineering blueprints to advance foundation models from “language intelligence” toward “physical intelligence.”

Technology Category

Application Category

📝 Abstract

The rapid emergence of foundation models, particularly Large Language Models (LLMs) and Vision-Language Models (VLMs), has introduced a transformative paradigm in robotics. These models offer powerful capabilities in semantic understanding, high-level reasoning, and cross-modal generalization, enabling significant advances in perception, planning, control, and human-robot interaction. This critical review provides a structured synthesis of recent developments, categorizing applications across simulation-driven design, open-world execution, sim-to-real transfer, and adaptable robotics. Unlike existing surveys that emphasize isolated capabilities, this work highlights integrated, system-level strategies and evaluates their practical feasibility in real-world environments. Key enabling trends such as procedural scene generation, policy generalization, and multimodal reasoning are discussed alongside core bottlenecks, including limited embodiment, lack of multimodal data, safety risks, and computational constraints. Through this lens, this paper identifies both the architectural strengths and critical limitations of foundation model-based robotics, highlighting open challenges in real-time operation, grounding, resilience, and trust. The review concludes with a roadmap for future research aimed at bridging semantic reasoning and physical intelligence through more robust, interpretable, and embodied models.

Problem

Research questions and friction points this paper is trying to address.

How foundation models enhance robotics perception and planning

Challenges in real-world application of multimodal robotics

Bridging semantic reasoning with physical robot intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Foundation models enhance semantic understanding and reasoning

Integrated system-level strategies for real-world robotics

Procedural scene generation and multimodal reasoning techniques

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey