RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weak policy generalization and the difficulty of jointly modeling target objects, placement regions, and geometric attributes hinder robotic manipulation. To address this, we propose a vision-language model–based intermediate representation framework leveraging grounded segmentation masks. Our approach is the first to integrate large-scale vision-language grounding priors into closed-loop robotic manipulation, utilizing a CLIP-style model to generate spatially aligned, differentiable masks that jointly guide target localization, shape/size perception, and cross-task semantic alignment. We introduce the first automated simulation data generation pipeline tailored for grounding-guided learning, enabling high-diversity synthesis of instruction-object-scene triples. Integrating behavior cloning with reinforcement learning in Gazebo and Isaac Gym, our method achieves an average 37.2% improvement in generalization across object-, scene-, and zero-shot instruction-transfer tasks—substantially outperforming text-only or image-only baselines.

Technology Category

Application Category

📝 Abstract
Recent advancements in robotic manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and size, and (2) broad generalization potential driven by large-scale vision-language models pretrained on diverse grounding datasets. We introduce RoboGround, a grounding-aware robotic manipulation system that leverages grounding masks as an intermediate representation to guide policy networks in object manipulation tasks. To further explore and enhance generalization, we propose an automated pipeline for generating large-scale, simulated data with a diverse set of objects and instructions. Extensive experiments show the value of our dataset and the effectiveness of grounding masks as intermediate guidance, significantly enhancing the generalization abilities of robot policies.
Problem

Research questions and friction points this paper is trying to address.

Improving robotic manipulation policy generalization using grounding masks
Leveraging vision-language models for spatial guidance in object manipulation
Generating large-scale simulated data to enhance robotic policy generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses grounding masks as intermediate spatial guidance
Leverages large-scale vision-language pretrained models
Generates simulated data for diverse generalization
Haifeng Huang
Haifeng Huang
Iowa State University
Computer VisionMulti-modal Learning
X
Xinyi Chen
Shanghai AI Laboratory
Y
Yilun Chen
Shanghai AI Laboratory
H
Hao Li
Shanghai AI Laboratory
X
Xiaoshen Han
Shanghai AI Laboratory
Z
Zehan Wang
Zhejiang University
Tai Wang
Tai Wang
Shanghai AI Laboratory
Computer Vision3D VisionEmbodied AIDeep Learning
J
Jiangmiao Pang
Shanghai AI Laboratory
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing