FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing vision-language navigation methods face limitations in heterogeneous robot compatibility, real-time performance, safety, and open-vocabulary generalization. This work proposes a brain-inspired “cerebrum–cerebellum” architecture that integrates vision-language models with deep reinforcement learning. The cerebrum module performs open-vocabulary semantic understanding and task planning through a three-level reasoning process, while the cerebellum module executes high-frequency, end-to-end local obstacle avoidance and control, supporting multimodal inputs and zero-shot navigation. Notably, the framework operates without requiring predefined target identifiers. It achieves state-of-the-art performance on the MP3D, HM3D, and OVON benchmarks, significantly improving both success rates and safety, and demonstrates strong generalization and robustness across multiple real-world robotic platforms.

Technology Category

Application Category

📝 Abstract

Current vision-language navigation methods face substantial bottlenecks regarding heterogeneous robot compatibility, real-time performance, and navigation safety. Furthermore, they struggle to support open-vocabulary semantic generalization and multimodal task inputs. To address these challenges, this paper proposes FSUNav: a Cerebrum-Cerebellum architecture for fast, safe, and universal zero-shot goal-oriented navigation, which innovatively integrates vision-language models (VLMs) with the proposed architecture. The cerebellum module, a high-frequency end-to-end module, develops a universal local planner based on deep reinforcement learning, enabling unified navigation across heterogeneous platforms (e.g., humanoid, quadruped, wheeled robots) to improve navigation efficiency while significantly reducing collision risk. The cerebrum module constructs a three-layer reasoning model and leverages VLMs to build an end-to-end detection and verification mechanism, enabling zero-shot open-vocabulary goal navigation without predefined IDs and improving task success rates in both simulation and real-world environments. Additionally, the framework supports multimodal inputs (e.g., text, target descriptions, and images), further enhancing generalization, real-time performance, safety, and robustness. Experimental results on MP3D, HM3D, and OVON benchmarks demonstrate that FSUNav achieves state-of-the-art performance on object, instance image, and task navigation, significantly outperforming existing methods. Real-world deployments on diverse robotic platforms further validate its robustness and practical applicability.

Problem

Research questions and friction points this paper is trying to address.

vision-language navigation

heterogeneous robot compatibility

real-time performance

navigation safety

open-vocabulary generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cerebrum-Cerebellum Architecture

Zero-Shot Navigation

Vision-Language Models