LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies

๐Ÿ“… 2026-02-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work proposes an end-to-end instruction-driven framework for quadrupedal locomotion that overcomes the limitations of geometry-centric approaches, which struggle to respond to high-level semantic commands. By integrating a large language model with a vision-language model, the system locally and in real time interprets environmental semantics and human instructions, constructs a semantic skill library, and matches it to generate high-fidelity motions through a style-conditioned policy network. Notably, this approach achieves semantic-to-action mapping without relying on cloud-based foundation modelsโ€”a first in the fieldโ€”and demonstrates an 87% instruction-following accuracy in real-world environments. The method significantly enhances the diversity, robustness, and controllability of locomotion styles in quadrupedal robots.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in legged locomotion learning are still dominated by the utilization of geometric representations of the environment, limiting the robot's capability to respond to higher-level semantics such as human instructions. To address this limitation, we propose a novel approach that integrates high-level commonsense reasoning from foundation models into the process of legged locomotion adaptation. Specifically, our method utilizes a pre-trained large language model to synthesize an instruction-grounded skill database tailored for legged robots. A pre-trained vision-language model is employed to extract high-level environmental semantics and ground them within the skill database, enabling real-time skill advisories for the robot. To facilitate versatile skill control, we train a style-conditioned policy capable of generating diverse and robust locomotion skills with high fidelity to specified styles. To the best of our knowledge, this is the first work to demonstrate real-time adaptation of legged locomotion using high-level reasoning from environmental semantics and instructions with instruction-following accuracy of up to 87% without the need for online query to on-the-cloud foundation models.
Problem

Research questions and friction points this paper is trying to address.

legged locomotion
semantic understanding
human instruction
vision-language grounding
adaptive policy
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language grounding
legged locomotion adaptation
foundation models
instruction-following robotics
style-conditioned policy
๐Ÿ”Ž Similar Papers