CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing behavior cloning methods pretrained on 2D images struggle to model 3D spatial information, limiting the accuracy and generalization of robotic manipulation. This work proposes a 3D multi-view action-conditioned pretraining framework that fuses RGB-D images with camera extrinsics to construct point clouds, which are then re-rendered into four-channel multi-view images incorporating depth and 3D coordinates. The approach introduces action-conditioned contrastive learning to align object geometry and spatial positions with robot actions, and integrates a diffusion policy for pretraining to enhance sample efficiency and performance on downstream tasks. By jointly leveraging 3D point clouds, multi-view wrist-mounted dynamic observations, and action-conditioned contrastive learning, this method enables end-to-end pretraining of 3D-aware policies, significantly outperforming existing approaches across six simulated and five real-world tasks.

Technology Category

Application Category

📝 Abstract

Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks.

Problem

Research questions and friction points this paper is trying to address.

3D spatial information

robotic manipulation

pre-trained representations

behavior cloning

multi-view perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive learning

3D point clouds

multi-view rendering