🤖 AI Summary
This work addresses the limitations of conventional robot policies that rely on 2D visual representations, which often fail to provide the explicit 3D spatial awareness required for high-precision manipulation tasks. Existing 3D fusion approaches are hindered by the irregular structure of sparse point clouds and geometric distortions introduced by multi-view orthographic rendering. To overcome these challenges, the authors propose the ReMAP-DP framework, which generates pixel-aligned PointMaps through normalized perspective reprojection and introduces a structure-aware dual-stream diffusion architecture. This architecture fuses frozen semantic features with explicit geometric descriptors via learnable modality embeddings, achieving implicit block-level alignment. Evaluated on RoboTwin 2.0, the method achieves an average success rate of 59.3%, outperforming the DP3 baseline by 6.6%; it also improves performance by 28% on the Stack Cube task in ManiSkill 3 and demonstrates efficient real-world execution of high-precision dynamic manipulation with only a few demonstrations.
📝 Abstract
Generalist robot policies built upon 2D visual representations excel at semantic reasoning but inherently lack the explicit 3D spatial awareness required for high-precision tasks. Existing 3D integration methods struggle to bridge this gap due to the structural irregularity of sparse point clouds and the geometric distortion introduced by multi-view orthographic rendering. To overcome these barriers, we present ReMAP-DP, a novel framework synergizing standardized perspective reprojection with a structure-aware dual-stream diffusion policy. By coupling the re-projected views with pixel-aligned PointMaps, our dual-stream architecture leverages learnable modality embeddings to fuse frozen semantic features and explicit geometric descriptors, ensuring precise implicit patch-level alignment. Extensive experiments across simulation and real-world environments demonstrate ReMAP-DP's superior performance in diverse manipulation tasks. On RoboTwin 2.0, it attains a 59.3% average success rate, outperforming the DP3 baseline by +6.6%. On ManiSkill 3, our method yields a 28% improvement over DP3 on the geometrically challenging Stack Cube task. Furthermore, ReMAP-DP exhibits remarkable real-world robustness, executing high-precision and dynamic manipulations with superior data efficiency from only a handful of demonstrations. Project page is available at: https://icr-lab.github.io/ReMAP-DP/