🤖 AI Summary
Embodied robotic manipulation faces critical challenges in the era of large models: a fundamental disconnect between high-level planning and low-level control; difficulty in jointly reasoning over multimodal information (language, code, motion, functionality, and 3D geometry); and limitations in scalability, data efficiency, physical interaction fidelity, and safety. To address these, this work proposes a unified “planning–control” framework. First, it generalizes task planning as multimodal, structured, long-horizon decision-making—integrating vision-language-action foundation models, neuro-symbolic planning, 3D scene understanding, and functional reasoning. Second, it establishes a training-paradigm–driven taxonomy of low-level control, characterizing its evolution across three stages: input modeling, implicit representation learning, and policy optimization. The framework systematically delineates the design space for large-model–driven manipulation, providing a scalable, principled methodology for next-generation embodied intelligence.
📝 Abstract
Recent advances in vision, language, and multimodal learning have substantially accelerated progress in robotic foundation models, with robot manipulation remaining a central and challenging problem. This survey examines robot manipulation from an algorithmic perspective and organizes recent learning-based approaches within a unified abstraction of high-level planning and low-level control. At the high level, we extend the classical notion of task planning to include reasoning over language, code, motion, affordances, and 3D representations, emphasizing their role in structured and long-horizon decision making. At the low level, we propose a training-paradigm-oriented taxonomy for learning-based control, organizing existing methods along input modeling, latent representation learning, and policy learning. Finally, we identify open challenges and prospective research directions related to scalability, data efficiency, multimodal physical interaction, and safety. Together, these analyses aim to clarify the design space of modern foundation models for robotic manipulation.