Embodied Learning of Reward for Musculoskeletal Control with Vision Language Models

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Manually designing reward functions for high-dimensional musculoskeletal motor control—especially from natural-language goals (e.g., “walk forward upright”)—is notoriously difficult due to the semantic gap between abstract instructions and low-level biomechanical dynamics. Method: We propose MoVLR, the first framework to integrate vision-language models (VLMs) into embodied reward learning. MoVLR aligns linguistic goal descriptions with physical motion representations via VLMs, then jointly optimizes task-specific rewards through closed-loop iteration of reinforcement learning and biomechanical dynamics modeling. Contribution/Results: MoVLR eliminates reliance on expert domain knowledge, enabling end-to-end mapping from semantic goals to implicit physiological coordination policies. Evaluated across diverse locomotion and manipulation tasks, the learned rewards significantly improve motion coordination, goal fidelity, and cross-task generalization—demonstrating VLMs’ efficacy as principled anchors for physiological movement priors.

Technology Category

Application Category

📝 Abstract

Discovering effective reward functions remains a fundamental challenge in motor control of high-dimensional musculoskeletal systems. While humans can describe movement goals explicitly such as "walking forward with an upright posture," the underlying control strategies that realize these goals are largely implicit, making it difficult to directly design rewards from high-level goals and natural language descriptions. We introduce Motion from Vision-Language Representation (MoVLR), a framework that leverages vision-language models (VLMs) to bridge the gap between goal specification and movement control. Rather than relying on handcrafted rewards, MoVLR iteratively explores the reward space through iterative interaction between control optimization and VLM feedback, aligning control policies with physically coordinated behaviors. Our approach transforms language and vision-based assessments into structured guidance for embodied learning, enabling the discovery and refinement of reward functions for high-dimensional musculoskeletal locomotion and manipulation. These results suggest that VLMs can effectively ground abstract motion descriptions in the implicit principles governing physiological motor control.

Problem

Research questions and friction points this paper is trying to address.

Discovering effective reward functions for high-dimensional musculoskeletal systems.

Bridging the gap between goal specification and movement control.

Transforming language and vision assessments into structured guidance for embodied learning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging vision-language models for reward function discovery

Iterative exploration of reward space via VLM feedback

Transforming language vision assessments into structured guidance

🔎 Similar Papers

Self Model for Embodied Intelligence: Modeling Full-Body Human Musculoskeletal System and Locomotion Control with Hierarchical Low-Dimensional Representation

2023-12-09IEEE International Conference on Robotics and AutomationCitations: 4

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

2024-05-22arXiv.orgCitations: 1

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)