VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Robot navigation faces dual challenges of environmental generalization and embodiment-specific adaptation. To address these, we propose a hierarchical vision-language-action (VLA) model that decouples high-level semantic planning from low-level embodied perception. Specifically, an open-world vision-language model serves as a general-purpose planner, generating natural-language interpretable navigation goals. A lightweight, task-specific adapter operates directly in image space to propose candidate paths and re-rank them by physical feasibility—explicitly encoding dynamic constraints of diverse platforms (e.g., quadrupedal and wheeled robots). This architecture enables cross-platform deployment and language-guided navigation without retraining. Evaluated in real-world indoor and outdoor environments, our method achieves a threefold improvement in navigation success rate over state-of-the-art approaches, while maintaining high reliability and end-to-end interpretability.

Technology Category

Application Category

📝 Abstract

A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples semantic planning from embodiment grounding: a generalist planner learns from diverse, open-world data, while a specialist affordance model learns the robot's physical constraints and capabilities in safe, low-cost simulation. We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. Our real-world experiments show that VAMOS achieves higher success rates in both indoor and complex outdoor navigation than state-of-the-art model-based and end-to-end learning methods. We also show that our hierarchical design enables cross-embodied navigation across legged and wheeled robots and is easily steerable using natural language. Real-world ablations confirm that the specialist model is key to embodiment grounding, enabling a single high-level planner to be deployed across physically distinct wheeled and legged robots. Finally, this model significantly enhances single-robot reliability, achieving 3X higher success rates by rejecting physically infeasible plans. Website: https://vamos-vla.github.io/

Problem

Research questions and friction points this paper is trying to address.

Generalizing robot navigation across diverse environments and embodiments

Decoupling semantic planning from physical constraint grounding

Enabling cross-embodied navigation between wheeled and legged robots

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical VLA decouples semantic planning from embodiment grounding

High-level planner proposes image-space paths for affordance evaluation

Specialist model enables cross-embodied navigation across different robot types

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models