Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Imitation learning (IL) suffers from severe generalization limitations across robotic platforms and novel camera poses. To address this, we propose an adaptive 3D scene representation framework featuring a novel “semantics-first, 3D-localization” architecture: pretrained Vision Transformers (ViTs) extract unified semantic embeddings from RGB-D inputs, while lightweight 3D geometric cues—e.g., end-effector coordinates—enable spatial alignment without explicit 3D reconstruction. By fusing multi-view RGB-D data and synthesizing 3D features via end-to-end differentiable rendering, our method decouples geometric and semantic representations within point clouds. The resulting representation is plug-and-play compatible with any IL policy, enabling zero-shot transfer across robot morphologies and camera extrinsics. We validate our approach on multiple state-of-the-art multi-task IL algorithms, demonstrating significant improvements over existing 3D representations and conducting comprehensive ablation studies.

Technology Category

Application Category

📝 Abstract
Imitation Learning (IL) has been very effective in training robots to perform complex and diverse manipulation tasks. However, its performance declines precipitously when the observations are out of the training distribution. 3D scene representations that incorporate observations from calibrated RGBD cameras have been proposed as a way to improve generalizability of IL policies, but our evaluations in cross-embodiment and novel camera pose settings found that they show only modest improvement. To address those challenges, we propose Adaptive 3D Scene Representation (Adapt3R), a general-purpose 3D observation encoder which uses a novel architecture to synthesize data from one or more RGBD cameras into a single vector that can then be used as conditioning for arbitrary IL algorithms. The key idea is to use a pretrained 2D backbone to extract semantic information about the scene, using 3D only as a medium for localizing this semantic information with respect to the end-effector. We show that when trained end-to-end with several SOTA multi-task IL algorithms, Adapt3R maintains these algorithms' multi-task learning capacity while enabling zero-shot transfer to novel embodiments and camera poses. Furthermore, we provide a detailed suite of ablation and sensitivity experiments to elucidate the design space for point cloud observation encoders.
Problem

Research questions and friction points this paper is trying to address.

Improves generalizability of Imitation Learning in cross-embodiment settings.
Enables zero-shot transfer to novel camera poses and embodiments.
Synthesizes data from multiple RGBD cameras for better scene representation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive 3D Scene Representation for domain transfer
Uses pretrained 2D backbone for semantic extraction
Enables zero-shot transfer to new embodiments
A
Albert Wilcox
Georgia Institute of Technology, Georgia Tech Research Institute
Mohamed Ghanem
Mohamed Ghanem
Georgia Institute of Technology
Masoud Moghani
Masoud Moghani
University of Toronto
Robot Learning
P
Pierre Barroso
Georgia Tech Research Institute
B
Benjamin Joffe
Georgia Institute of Technology, Georgia Tech Research Institute
Animesh Garg
Animesh Garg
Georgia Institute of Technology, University of Toronto
Robotic ManipulationRobot LearningReinforcement LearningMachine LearningComputer Vision