Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization capability of general-purpose robotic manipulation. We propose a text-guided, mask-driven goal-conditioned reinforcement learning framework. Methodologically, we integrate a pretrained object detection model with textual prompts to generate object-level semantic masks, which serve as goal-conditioned embeddings for end-to-end policy training. Crucially, we design a novel mask-driven goal-conditioning mechanism that enables object-agnostic feature sharing and policy transfer. Experiments on simulated grasping tasks demonstrate that our method maintains ~90% success rates on both in-distribution and out-of-distribution objects—significantly outperforming baseline approaches—and accelerates policy convergence. The core contribution is the first introduction of a text–vision–mask triadic coupling into goal-conditioned RL, substantially enhancing zero-shot generalization across diverse objects and manipulation tasks.

Technology Category

Application Category

📝 Abstract
General-purpose robotic manipulation, including reach and grasp, is essential for deployment into households and workspaces involving diverse and evolving tasks. Recent advances propose using large pre-trained models, such as Large Language Models and object detectors, to boost robotic perception in reinforcement learning. These models, trained on large datasets via self-supervised learning, can process text prompts and identify diverse objects in scenes, an invaluable skill in RL where learning object interaction is resource-intensive. This study demonstrates how to integrate such models into Goal-Conditioned Reinforcement Learning to enable general and versatile robotic reach and grasp capabilities. We use a pre-trained object detection model to enable the agent to identify the object from a text prompt and generate a mask for goal conditioning. Mask-based goal conditioning provides object-agnostic cues, improving feature sharing and generalization. The effectiveness of the proposed framework is demonstrated in a simulated reach-and-grasp task, where the mask-based goal conditioning consistently maintains a $sim$90% success rate in grasping both in and out-of-distribution objects, while also ensuring faster convergence to higher returns.
Problem

Research questions and friction points this paper is trying to address.

Enabling general-purpose robotic manipulation via goal-conditioned reinforcement learning
Integrating pre-trained object detection models to improve robotic perception
Achieving high success rates in grasping diverse objects using mask-based goal conditioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Goal-Conditioned Reinforcement Learning with object detection
Pre-trained object detection for text-prompted object identification
Mask-based goal conditioning for improved generalization
🔎 Similar Papers
No similar papers found.
H
Huiyi Wang
Digital Technologies, National Research Council of Canada, Ottawa, Canada
Colin Bellinger
Colin Bellinger
University of Ottawa
Machine LearningReinforcement LearningRoboticsActive LearningLimited and Imbalanced Data
F
Fahim Shahriar
University of Alberta, Amii, Edmonton, Canada
Alireza Azimi
Alireza Azimi
University of Alberta
reinforcement learningmachine learningmath
Gautham Vasan
Gautham Vasan
University of Alberta, Amii
Artificial IntelligenceReinforcement LearningRoboticsMachine Learning
R
Rupam Mahmood
University of Alberta, Amii, Edmonton, Canada