Learning Spatial-Aware Manipulation Ordering

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In cluttered environments, spatial dependencies among objects often lead to collisions or occlusions; existing methods frequently neglect these relationships, limiting operational flexibility and generalization. To address this, we propose OrderMind—a novel framework that unifies spatial context encoding and temporal priority modeling within an end-to-end operational sequencing planner. Specifically, we construct an object–manipulator interaction graph via k-nearest neighbors and leverage vision-language models to generate high-quality spatial priors for sequence supervision. Our approach jointly models intra-object structural constraints and inter-object–actuator spatial relationships. Evaluated on a benchmark of 160K samples, OrderMind consistently outperforms state-of-the-art methods in both simulation and real-world settings, achieving significant improvements in real-time inference, collision-free success rate, and kinematic reachability. The framework establishes a scalable, sequential decision-making paradigm for robust manipulation in complex, unstructured environments.

Technology Category

Application Category

📝 Abstract
Manipulation in cluttered environments is challenging due to spatial dependencies among objects, where an improper manipulation order can cause collisions or blocked access. Existing approaches often overlook these spatial relationships, limiting their flexibility and scalability. To address these limitations, we propose OrderMind, a unified spatial-aware manipulation ordering framework that directly learns object manipulation priorities based on spatial context. Our architecture integrates a spatial context encoder with a temporal priority structuring module. We construct a spatial graph using k-Nearest Neighbors to aggregate geometric information from the local layout and encode both object-object and object-manipulator interactions to support accurate manipulation ordering in real-time. To generate physically and semantically plausible supervision signals, we introduce a spatial prior labeling method that guides a vision-language model to produce reasonable manipulation orders for distillation. We evaluate OrderMind on our Manipulation Ordering Benchmark, comprising 163,222 samples of varying difficulty. Extensive experiments in both simulation and real-world environments demonstrate that our method significantly outperforms prior approaches in effectiveness and efficiency, enabling robust manipulation in cluttered scenes.
Problem

Research questions and friction points this paper is trying to address.

Learning object manipulation priorities from spatial context
Addressing spatial dependencies to prevent collisions in cluttered environments
Generating physically plausible manipulation orders through vision-language distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns object manipulation priorities using spatial context
Integrates spatial encoder with temporal priority structuring
Uses spatial graph and vision-language model for supervision
🔎 Similar Papers
No similar papers found.
Y
Yuxiang Yan
Fudan University
Zhiyuan Zhou
Zhiyuan Zhou
PhD student, UC Berkeley
RoboticsReinforcement Learning
X
Xin Gao
Fudan University
Guanghao Li
Guanghao Li
Fudan University
Graphics
S
Shenglin Li
Shanghai YinCheng Intelligent CO., LTD
J
Jiaqi Chen
Stanford University
Q
Qunyan Pu
Shanghai YinCheng Intelligent CO., LTD
Jian Pu
Jian Pu
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University
Autonomous SystemsComputer VisionMachine Learning