GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High acquisition cost and dense annotation requirements of real-world robotic manipulation data severely hinder zero-shot generalization and scalability of vision-language-action (VLA) models. Method: We propose the first VLA foundation model pretrained exclusively on SynGrasp-1B—a billion-scale synthetic grasping dataset—introducing a fully synthetic-data-driven architecture. Our approach integrates chain-of-thought training that jointly models autoregressive perception and flow-matching-based action generation, coupled with domain-randomized rendering and internet-scale semantic alignment to enable efficient sim-to-real transfer. Contribution/Results: The model achieves state-of-the-art performance on both real-world and simulated benchmarks, enabling open-vocabulary, zero-shot grasping generalization and few-shot adaptation to human preferences. To foster reproducibility and community advancement, we publicly release the code, the SynGrasp-1B dataset, and the pretrained model weights.

Technology Category

Application Category

📝 Abstract
Embodied foundation models are gaining increasing attention for their zero-shot generalization, scalability, and adaptability to new tasks through few-shot post-training. However, existing models rely heavily on real-world data, which is costly and labor-intensive to collect. Synthetic data offers a cost-effective alternative, yet its potential remains largely underexplored. To bridge this gap, we explore the feasibility of training Vision-Language-Action models entirely with large-scale synthetic action data. We curate SynGrasp-1B, a billion-frame robotic grasping dataset generated in simulation with photorealistic rendering and extensive domain randomization. Building on this, we present GraspVLA, a VLA model pretrained on large-scale synthetic action data as a foundational model for grasping tasks. GraspVLA integrates autoregressive perception tasks and flow-matching-based action generation into a unified Chain-of-Thought process, enabling joint training on synthetic action data and Internet semantics data. This design helps mitigate sim-to-real gaps and facilitates the transfer of learned actions to a broader range of Internet-covered objects, achieving open-vocabulary generalization in grasping. Extensive evaluations across real-world and simulation benchmarks demonstrate GraspVLA's advanced zero-shot generalizability and few-shot adaptability to specific human preferences. We will release SynGrasp-1B dataset and pre-trained weights to benefit the community.
Problem

Research questions and friction points this paper is trying to address.

Exploring synthetic data for Vision-Language-Action model training
Bridging sim-to-real gap in robotic grasping tasks
Achieving open-vocabulary generalization with synthetic action data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained on billion-scale synthetic grasping data
Unified Chain-of-Thought for perception and action
Mitigates sim-to-real gap via flow-matching
🔎 Similar Papers
No similar papers found.
S
Shengliang Deng
Galbot, The University of Hong Kong
M
Mi Yan
Galbot, Peking University
Songlin Wei
Songlin Wei
University of Southern California, (Previously) Peking University
Robotics3D Vision
H
Haixin Ma
Galbot
Y
Yuxin Yang
Galbot
J
Jiayi Chen
Galbot, Peking University
Z
Zhiqi Zhang
Galbot, Peking University
T
Taoyu Yang
Peking University
X
Xuheng Zhang
Peking University
Heming Cui
Heming Cui
University of Hong Kong
Operating SystemsProgramming LanguageDistributed SystemsSecurity
Z
Zhizheng Zhang
Galbot, Beijing Academy of Artificial Intelligence
H
He Wang
Galbot, Peking University, Beijing Academy of Artificial Intelligence