OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robots struggle with generalizing 3D fine manipulation in unstructured environments, heavily relying on large-scale annotated datasets. Method: We propose an object-centric interaction primitive representation grounded in functional intrinsic spaces (e.g., keypoints, principal axes) to define interpretable 3D spatial constraints, bridging high-level semantic reasoning of vision-language models (VLMs) with low-level precise execution. We introduce the first open-vocabulary, dual-closed-loop manipulation framework that achieves zero-shot cross-task generalization without VLM fine-tuning. It integrates 6D pose tracking, interactive rendering, primitive resampling, and spatial constraint mapping into an end-to-end closed feedback system. Contribution/Results: Our approach demonstrates strong zero-shot generalization across diverse manipulation tasks in both real-world and simulated environments. Moreover, it significantly improves the efficiency and generalizability of large-scale synthetic data generation for robotic manipulation.

Technology Category

Application Category

📝 Abstract
The development of general robotic systems capable of manipulating in unstructured environments is a significant challenge. While Vision-Language Models(VLM) excel in high-level commonsense reasoning, they lack the fine-grained 3D spatial understanding required for precise manipulation tasks. Fine-tuning VLM on robotic datasets to create Vision-Language-Action Models(VLA) is a potential solution, but it is hindered by high data collection costs and generalization issues. To address these challenges, we propose a novel object-centric representation that bridges the gap between VLM's high-level reasoning and the low-level precision required for manipulation. Our key insight is that an object's canonical space, defined by its functional affordances, provides a structured and semantically meaningful way to describe interaction primitives, such as points and directions. These primitives act as a bridge, translating VLM's commonsense reasoning into actionable 3D spatial constraints. In this context, we introduce a dual closed-loop, open-vocabulary robotic manipulation system: one loop for high-level planning through primitive resampling, interaction rendering and VLM checking, and another for low-level execution via 6D pose tracking. This design ensures robust, real-time control without requiring VLM fine-tuning. Extensive experiments demonstrate strong zero-shot generalization across diverse robotic manipulation tasks, highlighting the potential of this approach for automating large-scale simulation data generation.
Problem

Research questions and friction points this paper is trying to address.

3D Spatial Manipulation
Robotics
Adaptive Object Handling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Language Model
Dual-Layer System
Object-Function Integration
🔎 Similar Papers
No similar papers found.
Mingjie Pan
Mingjie Pan
Peking University
Jiyao Zhang
Jiyao Zhang
Peking University
Embodied AIRobotics3D Vision
T
Tianshu Wu
CFCS, School of CS, Peking University
Y
Yinghao Zhao
AgiBot
W
Wenlong Gao
AgiBot
H
Hao Dong
CFCS, School of CS, Peking University; PKU-AgiBot Lab