BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing Vision-Language-Action (VLA) methods largely neglect 3D spatial structure, resulting in low sample efficiency and poor generalization. To address this, we propose a novel “input–output 2D spatial alignment” paradigm: 3D observations are projected into multi-view 2D images, while action outputs are uniformly represented as 2D heatmaps, enabling end-to-end spatial alignment. We introduce a scalable heatmap pretraining strategy that equips vision-language models (VLMs) with native support for spatial action regression. Our method adapts VLM backbones and performs joint training across simulation (RLBench, COLOSSEUM, GemBench) and real-robot platforms. It achieves average success rates of 88.2%, 64.0%, and state-of-the-art performance on the three simulation benchmarks; outperforms baselines by 32% on real hardware; and attains 96.8% success with only three demonstration trajectories per task—demonstrating substantial improvements in cross-task generalization, distributional robustness, and out-of-distribution (OOD) resilience.

Technology Category

Application Category

📝 Abstract

Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/

Problem

Research questions and friction points this paper is trying to address.

Improving 3D manipulation learning efficiency with VLMs

Aligning 3D inputs and 2D outputs for spatial consistency

Enhancing sample efficiency in vision-language-action models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Projects 3D inputs to multiple 2D images

Uses 2D heatmaps for action prediction

Scalable pre-training for 2D heatmap prediction

🔎 Similar Papers

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models