🤖 AI Summary
Traditional reinforcement learning struggles to balance efficient exploration with deep semantic understanding in complex, unknown environments, primarily due to the limited cognitive capacity of policy networks and frequent reliance on human intervention. To address this, we propose an embodied semantic exploration framework endowed with high-level cognitive capabilities. Our method introduces a hierarchical reward mechanism to guide multi-stage decision-making, designs a vision-language model (VLM)-driven query action module for dynamic external commonsense retrieval, and incorporates curriculum learning for progressive capability acquisition. The approach unifies deep reinforcement learning, VLM-based commonsense reasoning, and structured reward engineering. Experimental results demonstrate substantial improvements in object discovery rates, autonomous navigation to semantically rich regions, and learned strategic invocation of VLM queries—enabling resource-efficient, commonsense-augmented exploration.
📝 Abstract
Navigating and understanding complex and unknown environments autonomously demands more than just basic perception and movement from embodied agents. Truly effective exploration requires agents to possess higher-level cognitive abilities, the ability to reason about their surroundings, and make more informed decisions regarding exploration strategies. However, traditional RL approaches struggle to balance efficient exploration and semantic understanding due to limited cognitive capabilities embedded in the small policies for the agents, leading often to human drivers when dealing with semantic exploration. In this paper, we address this challenge by presenting a novel Deep Reinforcement Learning (DRL) architecture that is specifically designed for resource efficient semantic exploration. A key methodological contribution is the integration of a Vision-Language Model (VLM) common-sense through a layered reward function. The VLM query is modeled as a dedicated action, allowing the agent to strategically query the VLM only when deemed necessary for gaining external guidance, thereby conserving resources. This mechanism is combined with a curriculum learning strategy designed to guide learning at different levels of complexity to ensure robust and stable learning. Our experimental evaluation results convincingly demonstrate that our agent achieves significantly enhanced object discovery rates and develops a learned capability to effectively navigate towards semantically rich regions. Furthermore, it also shows a strategic mastery of when to prompt for external environmental information. By demonstrating a practical and scalable method for embedding common-sense semantic reasoning with autonomous agents, this research provides a novel approach to pursuing a fully intelligent and self-guided exploration in robotics.