CDE: Concept-Driven Exploration for Reinforcement Learning

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

In visual reinforcement learning, efficient exploration of task-relevant structures from high-dimensional pixel inputs remains challenging. This paper proposes a concept-driven exploration framework that leverages a pre-trained vision-language model (VLM) to parse task instructions into object-centric, weakly supervised concepts, and introduces an autoencoding-based concept reconstruction mechanism. Intrinsic rewards are derived from reconstruction accuracy, guiding the policy to attend to semantically critical objects. Crucially, noisy VLM-derived concepts are internalized as learning signals during training—enabling semantic guidance without external VLM inference at deployment—thus significantly reducing computational overhead. Evaluated on five complex simulated visual manipulation tasks, the method achieves targeted and sample-efficient exploration. It further attains an 80% success rate on a real-world Franka Emika robotic arm, demonstrating both effectiveness and strong cross-platform transferability.

Technology Category

Application Category

📝 Abstract

Intelligent exploration remains a critical challenge in reinforcement learning (RL), especially in visual control tasks. Unlike low-dimensional state-based RL, visual RL must extract task-relevant structure from raw pixels, making exploration inefficient. We propose Concept-Driven Exploration (CDE), which leverages a pre-trained vision-language model (VLM) to generate object-centric visual concepts from textual task descriptions as weak, potentially noisy supervisory signals. Rather than directly conditioning on these noisy signals, CDE trains a policy to reconstruct the concepts via an auxiliary objective, using reconstruction accuracy as an intrinsic reward to guide exploration toward task-relevant objects. Because the policy internalizes these concepts, VLM queries are only needed during training, reducing dependence on external models during deployment. Across five challenging simulated visual manipulation tasks, CDE achieves efficient, targeted exploration and remains robust to noisy VLM predictions. Finally, we demonstrate real-world transfer by deploying CDE on a Franka Research 3 arm, attaining an 80% success rate in a real-world manipulation task.

Problem

Research questions and friction points this paper is trying to address.

Improves exploration efficiency in visual reinforcement learning tasks

Uses vision-language models to generate object-centric concept signals

Enables robust visual manipulation in simulated and real-world environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision-language model for object-centric concepts

Uses concept reconstruction as intrinsic exploration reward

Internalizes concepts to reduce external model dependency

🔎 Similar Papers

No similar papers found.