Large Pre-Trained Models for Bimanual Manipulation in 3D

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient perception-action coupling in dual-arm robotic 3D manipulation, this work introduces pixel-level attention maps from a pretrained self-supervised vision transformer (DINOv2) as semantic priors, which are geometrically consistent depth-guided projected and fused into a 3D voxel representation space to enhance scene understanding in behavior cloning policies. This is the first approach to explicitly translate ViT attention mechanisms into voxel-level semantic cues. Specifically, attention maps are upsampled into the voxel grid via depth-aware projection and jointly encoded with multi-view RGB-D inputs. Evaluated on the RLBench dual-arm manipulation benchmark, our method achieves an average absolute success rate improvement of 8.2% (+21.9% relative gain) over state-of-the-art methods, demonstrating significantly enhanced generalization and robustness—particularly on complex interactive tasks.

Technology Category

Application Category

📝 Abstract
We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.
Problem

Research questions and friction points this paper is trying to address.

Enhancing bimanual robotic manipulation using attention maps
Integrating Vision Transformer attention into 3D voxel representations
Improving behavior cloning policies with semantic 3D cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Vision Transformer attention maps
Lifts 2D saliency into 3D voxel grid
Enhances behavior cloning policy performance
🔎 Similar Papers
No similar papers found.