MUVLA: Learning to Explore Object Navigation via Map Understanding

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This paper addresses the inefficiency and poor generalization of exploration strategies for embodied agents navigating toward target objects in unknown environments. We propose a unified exploration framework grounded in semantic map understanding. Methodologically, we introduce a three-stage training paradigm: (1) constructing a structured semantic map that fuses visual, linguistic, and spatial contextual information; (2) incorporating a return-value prediction module to model dense short-horizon rewards, thereby strengthening supervision signals; and (3) jointly optimizing the action policy end-to-end via imitation learning and reinforcement learning. Our key contribution lies in explicitly encoding historical observations and spatial relationships into a learnable, reasoning-capable map representation, enabling robust learning from heterogeneous demonstrations—including partial successes and failures. Evaluated on HM3D and Gibson benchmarks, our approach achieves significant improvements in exploration efficiency and cross-scene generalization.

Technology Category

Application Category

📝 Abstract

In this paper, we present MUVLA, a Map Understanding Vision-Language-Action model tailored for object navigation. It leverages semantic map abstractions to unify and structure historical information, encoding spatial context in a compact and consistent form. MUVLA takes the current and history observations, as well as the semantic map, as inputs and predicts the action sequence based on the description of goal object. Furthermore, it amplifies supervision through reward-guided return modeling based on dense short-horizon progress signals, enabling the model to develop a detailed understanding of action value for reward maximization. MUVLA employs a three-stage training pipeline: learning map-level spatial understanding, imitating behaviors from mixed-quality demonstrations, and reward amplification. This strategy allows MUVLA to unify diverse demonstrations into a robust spatial representation and generate more rational exploration strategies. Experiments on HM3D and Gibson benchmarks demonstrate that MUVLA achieves great generalization and learns effective exploration behaviors even from low-quality or partially successful trajectories.

Problem

Research questions and friction points this paper is trying to address.

Develops a vision-language-action model for object navigation tasks

Uses semantic maps to structure spatial context and historical information

Amplifies supervision through reward modeling for effective exploration strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Map Understanding Vision-Language-Action model for object navigation

Leverages semantic map abstractions to structure historical information

Employs three-stage training with reward amplification strategy

🔎 Similar Papers

Find Everything: A General Vision Language Model Approach to Multi-Object Search