Towards a Unified Understanding of Robot Manipulation: A Comprehensive Survey

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robot manipulation in unstructured environments requires tight integration of perception, planning, and control—yet a unified framework and systematic benchmarks have long been lacking. This paper presents the first comprehensive, full-stack survey of robot manipulation. We propose a novel taxonomy that unifies high-level multimodal (language, code, motion) planning with low-level control grounded in input modeling, latent representation learning, and policy learning. We systematically categorize three fundamental bottlenecks: data acquisition, data utilization, and generalization. Further, we integrate vision, language, and large-scale multimodal models to advocate a learning-driven control paradigm. Concurrently, we release an open-source repository (GitHub) featuring task-oriented benchmarks, curated datasets, and structured methodological guidelines—serving as both an accessible entry point for newcomers and a scalable reference framework for researchers, thereby fostering coordinated advancement in the field.

Technology Category

Application Category

📝 Abstract
Embodied intelligence has witnessed remarkable progress in recent years, driven by advances in computer vision, natural language processing, and the rise of large-scale multimodal models. Among its core challenges, robot manipulation stands out as a fundamental yet intricate problem, requiring the seamless integration of perception, planning, and control to enable interaction within diverse and unstructured environments. This survey presents a comprehensive overview of robotic manipulation, encompassing foundational background, task-organized benchmarks and datasets, and a unified taxonomy of existing methods. We extend the classical division between high-level planning and low-level control by broadening high-level planning to include language, code, motion, affordance, and 3D representations, while introducing a new taxonomy of low-level learning-based control grounded in training paradigms such as input modeling, latent learning, and policy learning. Furthermore, we provide the first dedicated taxonomy of key bottlenecks, focusing on data collection, utilization, and generalization, and conclude with an extensive review of real-world applications. Compared with prior surveys, our work offers both a broader scope and deeper insight, serving as an accessible roadmap for newcomers and a structured reference for experienced researchers. All related resources, including research papers, open-source datasets, and projects, are curated for the community at https://github.com/BaiShuanghao/Awesome-Robotics-Manipulation.
Problem

Research questions and friction points this paper is trying to address.

Surveying robot manipulation's perception, planning, and control integration
Broadening high-level planning to include language and 3D representations
Addressing data collection, utilization, and generalization bottlenecks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Broadens high-level planning to multimodal representations
Introduces taxonomy for learning-based control paradigms
Provides first dedicated taxonomy for data bottlenecks