PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing methods suffer from either insufficient geometric fidelity in point cloud models or lack of explicit geometric awareness in RGB-only models, limiting precision and generalization in robotic manipulation. To address this, we propose PointMapPolicy, an end-to-end multimodal imitation learning framework. First, it maps raw point clouds to structured 3D grids without downsampling—preserving full geometric detail and enabling seamless cross-coordinate-system transformations. Second, it introduces xLSTM as the 3D backbone, enabling efficient joint modeling of geometric structure and RGB semantics. Third, it employs a diffusion-based policy decoder for robust action generation. Evaluated on RoboCasa, CALVIN, and real-robot platforms, PointMapPolicy achieves state-of-the-art performance, significantly improving success rates and robustness on complex manipulation tasks.

Technology Category

Application Category

📝 Abstract

Robotic manipulation systems benefit from complementary sensing modalities, where each provides unique environmental information. Point clouds capture detailed geometric structure, while RGB images provide rich semantic context. Current point cloud methods struggle to capture fine-grained detail, especially for complex tasks, which RGB methods lack geometric awareness, which hinders their precision and generalization. We introduce PointMapPolicy, a novel approach that conditions diffusion policies on structured grids of points without downsampling. The resulting data type makes it easier to extract shape and spatial relationships from observations, and can be transformed between reference frames. Yet due to their structure in a regular grid, we enable the use of established computer vision techniques directly to 3D data. Using xLSTM as a backbone, our model efficiently fuses the point maps with RGB data for enhanced multi-modal perception. Through extensive experiments on the RoboCasa and CALVIN benchmarks and real robot evaluations, we demonstrate that our method achieves state-of-the-art performance across diverse manipulation tasks. The overview and demos are available on our project page: https://point-map.github.io/Point-Map/

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in point cloud fine-grained detail capture

Overcomes RGB methods' lack of geometric awareness in robotics

Enhances multi-modal perception for robotic manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured point cloud grids without downsampling

Direct computer vision application to 3D data

xLSTM backbone fuses point maps with RGB

🔎 Similar Papers

Point-JEPA: A Joint Embedding Predictive Architecture for Self-Supervised Learning on Point Cloud

2024-04-25arXiv.orgCitations: 6