P3Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

In language-guided embodied navigation, agents often fail to localize accurately during long-horizon navigation in unseen environments due to redundant historical perception and insufficient multimodal understanding. To address this, we propose the first unified framework integrating perception, planning, and prediction modules in a synergistic manner. Our key contributions are: (1) an adaptive 3D perception history sampling strategy that dynamically compresses redundant historical memory; and (2) a joint architecture combining multi-task learning (navigation + embodied question answering), large language model–driven semantic understanding, and explicit 3D scene modeling. Evaluated on the CHORES-$mathbb{S}$ benchmark, our method achieves a 75% success rate in goal-oriented navigation, significantly outperforming existing state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

In language-guided visual navigation, agents locate target objects in unseen environments using natural language instructions. For reliable navigation in unfamiliar scenes, agents must possess strong perception, planning, and prediction capabilities. Additionally, when agents revisit previously explored areas during long-term navigation, they may retain irrelevant and redundant historical perceptions, leading to suboptimal results. In this work, we introduce extbf{P3Nav}, a unified framework that integrates extbf{P}erception, extbf{P}lanning, and extbf{P}rediction capabilities through extbf{Multitask Collaboration} on navigation and embodied question answering (EQA) tasks, thereby enhancing navigation performance. Furthermore, P3Nav employs an extbf{Adaptive 3D-aware History Sampling} strategy to effectively and efficiently utilize historical observations. By leveraging the large language models (LLM), P3Nav comprehends diverse commands and complex visual scenes, resulting in appropriate navigation actions. P3Nav achieves a 75% success rate in object goal navigation on the $mathrm{CHORES}$-$mathbb{S}$ benchmark, setting a new state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Enhance navigation by integrating perception, planning, and prediction.

Reduce redundant historical perceptions in long-term navigation tasks.

Improve language-guided object localization in unseen environments.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework integrating perception, planning, prediction

Adaptive 3D-aware history sampling strategy

Leverages LLMs for command and scene comprehension

🔎 Similar Papers

Advances in Embodied Navigation Using Large Language Models: A Survey