PIG-Nav: Key Insights for Pretrained Image Goal Navigation Models

πŸ“… 2025-07-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited zero-shot and cross-environment generalization capability of visual navigation models. To this end, we propose PIG-Navβ€”a framework built upon a pretrained ViT image encoder, featuring an early-fusion architecture that jointly encodes egocentric observations and goal images, and incorporating a multi-task self-supervised learning mechanism to enhance cross-environment representation learning. Furthermore, we introduce a lightweight, automated game-video annotation pipeline to enable large-scale navigation pretraining. Extensive experiments across two simulation platforms and one real-world robot environment demonstrate that PIG-Nav achieves an average 22.6% improvement in zero-shot navigation performance. With only minimal fine-tuning data, it attains state-of-the-art results, outperforming prior methods by 37.5%. The framework significantly improves model transferability and practical applicability for real-world deployment.

Technology Category

Application Category

πŸ“ Abstract
Recent studies have explored pretrained (foundation) models for vision-based robotic navigation, aiming to achieve generalizable navigation and positive transfer across diverse environments while enhancing zero-shot performance in unseen settings. In this work, we introduce PIG-Nav (Pretrained Image-Goal Navigation), a new approach that further investigates pretraining strategies for vision-based navigation models and contributes in two key areas. Model-wise, we identify two critical design choices that consistently improve the performance of pretrained navigation models: (1) integrating an early-fusion network structure to combine visual observations and goal images via appropriately pretrained Vision Transformer (ViT) image encoder, and (2) introducing suitable auxiliary tasks to enhance global navigation representation learning, thus further improving navigation performance. Dataset-wise, we propose a novel data preprocessing pipeline for efficiently labeling large-scale game video datasets for navigation model training. We demonstrate that augmenting existing open navigation datasets with diverse gameplay videos improves model performance. Our model achieves an average improvement of 22.6% in zero-shot settings and a 37.5% improvement in fine-tuning settings over existing visual navigation foundation models in two complex simulated environments and one real-world environment. These results advance the state-of-the-art in pretrained image-goal navigation models. Notably, our model maintains competitive performance while requiring significantly less fine-tuning data, highlighting its potential for real-world deployment with minimal labeled supervision.
Problem

Research questions and friction points this paper is trying to address.

Improving pretraining strategies for vision-based navigation models
Enhancing zero-shot performance in unseen navigation environments
Efficiently labeling large-scale datasets for navigation training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early-fusion ViT encoder combines vision and goal
Auxiliary tasks enhance global navigation learning
Gameplay video data preprocessing improves training
πŸ”Ž Similar Papers
No similar papers found.
J
Jiansong Wan
C
Chengming Zhou
J
Jinkua Liu
X
Xiangge Huang
X
Xiaoyu Chen
X
Xiaohan Yi
Microsoft Research
Q
Qisen Yang
B
Baiting Zhu
Xin-Qiang Cai
Xin-Qiang Cai
RIKEN Center for Advanced Intelligence Project
Machine LearningReinforcement LearningImitation Learning
L
Lixing Liu
Rushuai Yang
Rushuai Yang
Hong Kong University of Science and Technology
Reinforcement LearningEmbodied AI
C
Chuheng Zhang
Microsoft Research
Sherif Abdelfattah
Sherif Abdelfattah
Hayong Shin
Hayong Shin
Pushi Zhang
Pushi Zhang
Microsoft Research
Reinforcement LearningRobot LearningEmbodied AI
L
Li Zhao
Microsoft Research
J
Jiang Bian
Microsoft Research