General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

General embodied navigation in unknown environments remains challenged by poor generalization, reliance on pre-built maps, and rigid procedural constraints. To address these limitations, we propose ARNA—the first Large Vision-Language Model (LVLM)-native agent architecture explicitly designed for navigation tasks. ARNA enables dynamic, closed-loop coordination among perception, reasoning, and action through multimodal sensory fusion, runtime-adaptive tool invocation, and iterative reasoning—eliminating dependencies on prior mapping, hand-crafted rules, and myopic exploration. Built upon a modular robot-stack interface, it supports iterative action planning and autonomous generation of multi-step workflows. Evaluated on the HM-EQA benchmark in Habitat Lab, ARNA achieves state-of-the-art performance, significantly improving exploration efficiency, navigation accuracy, and embodied question-answering capability in map-free settings. This work establishes a scalable architectural paradigm for general-purpose embodied agents.

Technology Category

Application Category

📝 Abstract

Developing general-purpose navigation policies for unknown environments remains a core challenge in robotics. Most existing systems rely on task-specific neural networks and fixed data flows, limiting generalizability. Large Vision-Language Models (LVLMs) offer a promising alternative by embedding human-like knowledge suitable for reasoning and planning. Yet, prior LVLM-robot integrations typically depend on pre-mapped spaces, hard-coded representations, and myopic exploration. We introduce the Agentic Robotic Navigation Architecture (ARNA), a general-purpose navigation framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools available within modern robotic stacks. At runtime, the agent autonomously defines and executes task-specific workflows that iteratively query the robotic modules, reason over multimodal inputs, and select appropriate navigation actions. This approach enables robust navigation and reasoning in previously unmapped environments, providing a new perspective on robotic stack design. Evaluated in Habitat Lab on the HM-EQA benchmark, ARNA achieves state-of-the-art performance, demonstrating effective exploration, navigation, and embodied question answering without relying on handcrafted plans, fixed input representations, or pre-existing maps.

Problem

Research questions and friction points this paper is trying to address.

Developing general-purpose navigation for unknown robotic environments

Overcoming limitations of task-specific neural networks in robotics

Enabling autonomous reasoning and navigation without pre-mapped spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

LVLM-based agent with modular robotic tools

Autonomous task-specific workflow execution

Robust navigation in unmapped environments

🔎 Similar Papers

Intelligent LiDAR Navigation: Leveraging External Information and Semantic Maps with LLM as Copilot