CDE: Concept-Driven Exploration for Reinforcement Learning

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In visual reinforcement learning, efficient exploration of task-relevant structures from high-dimensional pixel inputs remains challenging. This paper proposes a concept-driven exploration framework that leverages a pre-trained vision-language model (VLM) to parse task instructions into object-centric, weakly supervised concepts, and introduces an autoencoding-based concept reconstruction mechanism. Intrinsic rewards are derived from reconstruction accuracy, guiding the policy to attend to semantically critical objects. Crucially, noisy VLM-derived concepts are internalized as learning signals during training—enabling semantic guidance without external VLM inference at deployment—thus significantly reducing computational overhead. Evaluated on five complex simulated visual manipulation tasks, the method achieves targeted and sample-efficient exploration. It further attains an 80% success rate on a real-world Franka Emika robotic arm, demonstrating both effectiveness and strong cross-platform transferability.

Technology Category

Application Category

📝 Abstract
Intelligent exploration remains a critical challenge in reinforcement learning (RL), especially in visual control tasks. Unlike low-dimensional state-based RL, visual RL must extract task-relevant structure from raw pixels, making exploration inefficient. We propose Concept-Driven Exploration (CDE), which leverages a pre-trained vision-language model (VLM) to generate object-centric visual concepts from textual task descriptions as weak, potentially noisy supervisory signals. Rather than directly conditioning on these noisy signals, CDE trains a policy to reconstruct the concepts via an auxiliary objective, using reconstruction accuracy as an intrinsic reward to guide exploration toward task-relevant objects. Because the policy internalizes these concepts, VLM queries are only needed during training, reducing dependence on external models during deployment. Across five challenging simulated visual manipulation tasks, CDE achieves efficient, targeted exploration and remains robust to noisy VLM predictions. Finally, we demonstrate real-world transfer by deploying CDE on a Franka Research 3 arm, attaining an 80% success rate in a real-world manipulation task.
Problem

Research questions and friction points this paper is trying to address.

Improves exploration efficiency in visual reinforcement learning tasks
Uses vision-language models to generate object-centric concept signals
Enables robust visual manipulation in simulated and real-world environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language model for object-centric concepts
Uses concept reconstruction as intrinsic exploration reward
Internalizes concepts to reduce external model dependency
🔎 Similar Papers
No similar papers found.