Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

📅 2024-12-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing navigation models are constrained by task specialization and reliance on explicit maps, limiting generalization to unseen real-world environments. This paper introduces the first video-driven unified Vision-Language-Action (VLA) model for embodied navigation, supporting diverse tasks including instruction following, goal-oriented search, visual question answering, and person tracking. Our approach features: (1) standardized multi-task input-output formats with end-to-end single-model learning; (2) a cross-task collaborative training paradigm that eliminates dependence on predefined waypoints or explicit map representations; and (3) a large-scale navigation dataset comprising 3.6 million real-scene video samples. Extensive evaluation demonstrates state-of-the-art performance across multiple benchmarks. Crucially, the model exhibits strong zero-shot generalization to previously unseen physical environments and robust long-horizon, multi-task navigation capabilities—without any environment-specific fine-tuning or map priors.

Technology Category

Application Category

📝 Abstract

A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training Uni-NaVid, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model's effectiveness and efficiency, shedding light on its strong generalizability.

Problem

Research questions and friction points this paper is trying to address.

Unifies diverse embodied navigation tasks

Handles mixed long-horizon tasks in unseen environments

Integrates multiple navigation tasks into one model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-based VLA model

Unifies navigation tasks

Integrates diverse data configurations

🔎 Similar Papers

No similar papers found.