PRISM: Pointcloud Reintegrated Inference via Segmentation and Cross-attention for Manipulation

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient 3D perception robustness of robotic imitation learning in cluttered environments, existing methods are constrained by fixed-view inputs or support only keyframe prediction, limiting their effectiveness in dynamic, high-contact tasks. This paper proposes PRISM, an end-to-end framework that jointly learns from raw point clouds and robot state without pretraining or external data. Its core contributions are threefold: (1) a segmentation-embedding-guided point cloud re-integration mechanism to enhance geometric reasoning; (2) cross-modal attention for fusing local point-wise geometry with proprioceptive state; and (3) a diffusion-based policy modeling approach that generates smooth, temporally coherent actions. Evaluated under a low-data regime—only 100 demonstrations per task—PRISM significantly outperforms state-of-the-art 2D and 3D baselines, achieving superior accuracy and robustness in highly cluttered, multi-object scenes.

Technology Category

Application Category

📝 Abstract
Robust imitation learning for robot manipulation requires comprehensive 3D perception, yet many existing methods struggle in cluttered environments. Fixed camera view approaches are vulnerable to perspective changes, and 3D point cloud techniques often limit themselves to keyframes predictions, reducing their efficacy in dynamic, contact-intensive tasks. To address these challenges, we propose PRISM, designed as an end-to-end framework that directly learns from raw point cloud observations and robot states, eliminating the need for pretrained models or external datasets. PRISM comprises three main components: a segmentation embedding unit that partitions the raw point cloud into distinct object clusters and encodes local geometric details; a cross-attention component that merges these visual features with processed robot joint states to highlight relevant targets; and a diffusion module that translates the fused representation into smooth robot actions. With training on 100 demonstrations per task, PRISM surpasses both 2D and 3D baseline policies in accuracy and efficiency within our simulated environments, demonstrating strong robustness in complex, object-dense scenarios. Code and some demos are available on https://github.com/czknuaa/PRISM.
Problem

Research questions and friction points this paper is trying to address.

Robust imitation learning in cluttered 3D environments
Overcoming fixed camera view limitations in manipulation tasks
Enhancing point cloud techniques for dynamic contact-intensive tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segmentation embedding for object clusters
Cross-attention merges visual and robot states
Diffusion module generates smooth robot actions
🔎 Similar Papers
No similar papers found.
D
Daqi Huang
National University of Singapore
Z
Zhehao Cai
National University of Singapore
Y
Yuzhi Hao
National University of Singapore
Zechen Li
Zechen Li
University of New South Wales, Sydney
LLMsWearable AIAI for Healthcare
C
Chee-Meng Chew
National University of Singapore