🤖 AI Summary
This work addresses zero-shot object-goal navigation—a challenging setting where agents must locate target objects specified by natural language without task-specific training, environmental interaction, or annotated data. Methodologically, we propose the first general-purpose navigation framework that synergistically integrates vision foundation models (VFMs) with a model-based frontier-point exploration planner, enabling long-horizon semantic reasoning and cross-scene generalization. We evaluate our approach on the Habitat simulation platform and the photorealistic HM3D 3D scene dataset under strict zero-shot conditions. Results demonstrate state-of-the-art performance in Success weighted by Path Length (SPL), significantly outperforming prior methods. Crucially, our framework eliminates reliance on large-scale expert demonstrations, environment-specific priors, or dense supervision—marking a departure from conventional learning-based navigation paradigms. By unifying foundational visual understanding with interpretable, model-driven planning, our approach establishes a scalable, generalizable paradigm for open-world embodied intelligence.
📝 Abstract
Object goal navigation is a fundamental task in embodied AI, where an agent is instructed to locate a target object in an unexplored environment. Traditional learning-based methods rely heavily on large-scale annotated data or require extensive interaction with the environment in a reinforcement learning setting, often failing to generalize to novel environments and limiting scalability. To overcome these challenges, we explore a zero-shot setting where the agent operates without task-specific training, enabling more scalable and adaptable solution. Recent advances in Vision Foundation Models (VFMs) offer powerful capabilities for visual understanding and reasoning, making them ideal for agents to comprehend scenes, identify relevant regions, and infer the likely locations of objects. In this work, we present a zero-shot object goal navigation framework that integrates the perceptual strength of VFMs with a model-based planner that is capable of long-horizon decision making through frontier exploration. We evaluate our approach on the HM3D dataset using the Habitat simulator and demonstrate that our method achieves state-of-the-art performance in terms of success weighted by path length for zero-shot object goal navigation.