Embodied Robot Manipulation in the Era of Foundation Models: Planning and Learning Perspectives

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Embodied robotic manipulation faces critical challenges in the era of large models: a fundamental disconnect between high-level planning and low-level control; difficulty in jointly reasoning over multimodal information (language, code, motion, functionality, and 3D geometry); and limitations in scalability, data efficiency, physical interaction fidelity, and safety. To address these, this work proposes a unified “planning–control” framework. First, it generalizes task planning as multimodal, structured, long-horizon decision-making—integrating vision-language-action foundation models, neuro-symbolic planning, 3D scene understanding, and functional reasoning. Second, it establishes a training-paradigm–driven taxonomy of low-level control, characterizing its evolution across three stages: input modeling, implicit representation learning, and policy optimization. The framework systematically delineates the design space for large-model–driven manipulation, providing a scalable, principled methodology for next-generation embodied intelligence.

Technology Category

Application Category

📝 Abstract

Recent advances in vision, language, and multimodal learning have substantially accelerated progress in robotic foundation models, with robot manipulation remaining a central and challenging problem. This survey examines robot manipulation from an algorithmic perspective and organizes recent learning-based approaches within a unified abstraction of high-level planning and low-level control. At the high level, we extend the classical notion of task planning to include reasoning over language, code, motion, affordances, and 3D representations, emphasizing their role in structured and long-horizon decision making. At the low level, we propose a training-paradigm-oriented taxonomy for learning-based control, organizing existing methods along input modeling, latent representation learning, and policy learning. Finally, we identify open challenges and prospective research directions related to scalability, data efficiency, multimodal physical interaction, and safety. Together, these analyses aim to clarify the design space of modern foundation models for robotic manipulation.

Problem

Research questions and friction points this paper is trying to address.

Surveying robot manipulation challenges in the foundation model era

Organizing learning approaches via high-level planning and low-level control

Identifying open challenges in scalability, data efficiency, and safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified planning abstraction combining language, code, and 3D reasoning

Training-paradigm taxonomy for learning-based control methods

Algorithmic perspective organizing high-level planning and low-level control

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey