LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work proposes an end-to-end instruction-driven framework for quadrupedal locomotion that overcomes the limitations of geometry-centric approaches, which struggle to respond to high-level semantic commands. By integrating a large language model with a vision-language model, the system locally and in real time interprets environmental semantics and human instructions, constructs a semantic skill library, and matches it to generate high-fidelity motions through a style-conditioned policy network. Notably, this approach achieves semantic-to-action mapping without relying on cloud-based foundation models—a first in the field—and demonstrates an 87% instruction-following accuracy in real-world environments. The method significantly enhances the diversity, robustness, and controllability of locomotion styles in quadrupedal robots.

Technology Category

Application Category

📝 Abstract

Recent advances in legged locomotion learning are still dominated by the utilization of geometric representations of the environment, limiting the robot's capability to respond to higher-level semantics such as human instructions. To address this limitation, we propose a novel approach that integrates high-level commonsense reasoning from foundation models into the process of legged locomotion adaptation. Specifically, our method utilizes a pre-trained large language model to synthesize an instruction-grounded skill database tailored for legged robots. A pre-trained vision-language model is employed to extract high-level environmental semantics and ground them within the skill database, enabling real-time skill advisories for the robot. To facilitate versatile skill control, we train a style-conditioned policy capable of generating diverse and robust locomotion skills with high fidelity to specified styles. To the best of our knowledge, this is the first work to demonstrate real-time adaptation of legged locomotion using high-level reasoning from environmental semantics and instructions with instruction-following accuracy of up to 87% without the need for online query to on-the-cloud foundation models.

Problem

Research questions and friction points this paper is trying to address.

legged locomotion

semantic understanding

human instruction

vision-language grounding

adaptive policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language grounding

legged locomotion adaptation

foundation models