Large Pre-Trained Models for Bimanual Manipulation in 3D

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

To address insufficient perception-action coupling in dual-arm robotic 3D manipulation, this work introduces pixel-level attention maps from a pretrained self-supervised vision transformer (DINOv2) as semantic priors, which are geometrically consistent depth-guided projected and fused into a 3D voxel representation space to enhance scene understanding in behavior cloning policies. This is the first approach to explicitly translate ViT attention mechanisms into voxel-level semantic cues. Specifically, attention maps are upsampled into the voxel grid via depth-aware projection and jointly encoded with multi-view RGB-D inputs. Evaluated on the RLBench dual-arm manipulation benchmark, our method achieves an average absolute success rate improvement of 8.2% (+21.9% relative gain) over state-of-the-art methods, demonstrating significantly enhanced generalization and robustness—particularly on complex interactive tasks.

Technology Category

Application Category

📝 Abstract

We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

Problem

Research questions and friction points this paper is trying to address.

Enhancing bimanual robotic manipulation using attention maps

Integrating Vision Transformer attention into 3D voxel representations

Improving behavior cloning policies with semantic 3D cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Vision Transformer attention maps

Lifts 2D saliency into 3D voxel grid

Enhances behavior cloning policy performance

🔎 Similar Papers

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids