🤖 AI Summary
To address insufficient perception-action coupling in dual-arm robotic 3D manipulation, this work introduces pixel-level attention maps from a pretrained self-supervised vision transformer (DINOv2) as semantic priors, which are geometrically consistent depth-guided projected and fused into a 3D voxel representation space to enhance scene understanding in behavior cloning policies. This is the first approach to explicitly translate ViT attention mechanisms into voxel-level semantic cues. Specifically, attention maps are upsampled into the voxel grid via depth-aware projection and jointly encoded with multi-view RGB-D inputs. Evaluated on the RLBench dual-arm manipulation benchmark, our method achieves an average absolute success rate improvement of 8.2% (+21.9% relative gain) over state-of-the-art methods, demonstrating significantly enhanced generalization and robustness—particularly on complex interactive tasks.
📝 Abstract
We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.