PRISM-DP: Spatial Pose-based Observations for Diffusion-Policies via Segmentation, Mesh Generation, and Pose Tracking

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

In open-world settings, 6D object pose estimation faces key bottlenecks including unlabeled objects and reliance on pre-scanned CAD meshes. To address these, we propose the first end-to-end mesh-free pose-aware framework that treats task-relevant 6D poses as structured observations—replacing high-dimensional, redundant raw image inputs. Our method unifies instance segmentation, neural implicit surface reconstruction, learned 6D pose estimation, and temporal pose tracking into a closed-loop spatial observation pipeline. It requires no manual modeling or prior object scanning, enabling automatic segmentation and real-time pose tracking for arbitrary novel objects. Experiments in simulation and on real robotic platforms demonstrate that our approach significantly outperforms RGB-driven diffusion-based methods, achieves performance on par with oracle state-supervised baselines, and reduces model parameters by over 60%. The framework thus delivers superior efficiency, generalization, and practical applicability.

Technology Category

Application Category

📝 Abstract

Diffusion-based visuomotor policies generate robot motions by learning to denoise action-space trajectories conditioned on observations. These observations are commonly streams of RGB images, whose high dimensionality includes substantial task-irrelevant information, requiring large models to extract relevant patterns. In contrast, using more structured observations, such as the spatial poses (positions and orientations) of key objects over time, enables training more compact policies that can recognize relevant patterns with fewer parameters. However, obtaining accurate object poses in open-set, real-world environments remains challenging. For instance, it is impractical to assume that all relevant objects are equipped with markers, and recent learning-based 6D pose estimation and tracking methods often depend on pre-scanned object meshes, requiring manual reconstruction. In this work, we propose PRISM-DP, an approach that leverages segmentation, mesh generation, pose estimation, and pose tracking models to enable compact diffusion policy learning directly from the spatial poses of task-relevant objects. Crucially, because PRISM-DP uses a mesh generation model, it eliminates the need for manual mesh processing or creation, improving scalability and usability in open-set, real-world environments. Experiments across a range of tasks in both simulation and real-world settings show that PRISM-DP outperforms high-dimensional image-based diffusion policies and achieves performance comparable to policies trained with ground-truth state information. We conclude with a discussion of the broader implications and limitations of our approach.

Problem

Research questions and friction points this paper is trying to address.

Extracts spatial poses of objects for compact policy learning

Eliminates manual mesh processing for scalability in real-world environments

Outperforms image-based policies using pose-based observations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages segmentation and mesh generation for pose tracking

Eliminates manual mesh processing for scalability

Uses spatial poses for compact diffusion policy learning

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos