RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Weak policy generalization and the difficulty of jointly modeling target objects, placement regions, and geometric attributes hinder robotic manipulation. To address this, we propose a vision-language model–based intermediate representation framework leveraging grounded segmentation masks. Our approach is the first to integrate large-scale vision-language grounding priors into closed-loop robotic manipulation, utilizing a CLIP-style model to generate spatially aligned, differentiable masks that jointly guide target localization, shape/size perception, and cross-task semantic alignment. We introduce the first automated simulation data generation pipeline tailored for grounding-guided learning, enabling high-diversity synthesis of instruction-object-scene triples. Integrating behavior cloning with reinforcement learning in Gazebo and Isaac Gym, our method achieves an average 37.2% improvement in generalization across object-, scene-, and zero-shot instruction-transfer tasks—substantially outperforming text-only or image-only baselines.

Technology Category

Application Category

📝 Abstract

Recent advancements in robotic manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and size, and (2) broad generalization potential driven by large-scale vision-language models pretrained on diverse grounding datasets. We introduce RoboGround, a grounding-aware robotic manipulation system that leverages grounding masks as an intermediate representation to guide policy networks in object manipulation tasks. To further explore and enhance generalization, we propose an automated pipeline for generating large-scale, simulated data with a diverse set of objects and instructions. Extensive experiments show the value of our dataset and the effectiveness of grounding masks as intermediate guidance, significantly enhancing the generalization abilities of robot policies.

Problem

Research questions and friction points this paper is trying to address.

Improving robotic manipulation policy generalization using grounding masks

Leveraging vision-language models for spatial guidance in object manipulation

Generating large-scale simulated data to enhance robotic policy generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses grounding masks as intermediate spatial guidance

Leverages large-scale vision-language pretrained models

Generates simulated data for diverse generalization

🔎 Similar Papers

HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models

2024-09-16arXiv.orgCitations: 3

Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

2023-12-17Citations: 10

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15