Cross-Modal Instructions for Robot Motion Generation

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional robot behavior learning relies on teleoperation or physical demonstration, incurring high data-collection costs and poor generalization. This paper proposes a cross-modal instruction learning framework that, for the first time, replaces physical motion demonstrations with coarse-grained textual descriptions and hand-drawn sketches to enable demonstration-free behavioral shaping. Our method integrates a foundational vision-language model (VLM) with a fine-grained pointing model to geometrically synthesize continuous 3D trajectories from multi-view 2D observations. These trajectories are optimized jointly via 3D trajectory distribution fusion and downstream reinforcement learning (RL). The framework achieves few-shot cross-environment generalization and is validated in both simulation and real-world hardware: it generates executable actions without fine-tuning and provides high-quality policy initialization for RL, significantly improving learning efficiency on dexterous manipulation tasks.

Technology Category

Application Category

📝 Abstract
Teaching robots novel behaviors typically requires motion demonstrations via teleoperation or kinaesthetic teaching, that is, physically guiding the robot. While recent work has explored using human sketches to specify desired behaviors, data collection remains cumbersome, and demonstration datasets are difficult to scale. In this paper, we introduce an alternative paradigm, Learning from Cross-Modal Instructions, where robots are shaped by demonstrations in the form of rough annotations, which can contain free-form text labels, and are used in lieu of physical motion. We introduce the CrossInstruct framework, which integrates cross-modal instructions as examples into the context input to a foundational vision-language model (VLM). The VLM then iteratively queries a smaller, fine-tuned model, and synthesizes the desired motion over multiple 2D views. These are then subsequently fused into a coherent distribution over 3D motion trajectories in the robot's workspace. By incorporating the reasoning of the large VLM with a fine-grained pointing model, CrossInstruct produces executable robot behaviors that generalize beyond the environment of in the limited set of instruction examples. We then introduce a downstream reinforcement learning pipeline that leverages CrossInstruct outputs to efficiently learn policies to complete fine-grained tasks. We rigorously evaluate CrossInstruct on benchmark simulation tasks and real hardware, demonstrating effectiveness without additional fine-tuning and providing a strong initialization for policies subsequently refined via reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

Teaching robots novel behaviors without physical motion demonstrations
Generating executable robot motion from cross-modal instructions like annotations
Creating scalable robot behavior learning that generalizes beyond limited examples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses cross-modal instructions instead of physical demonstrations
Integrates instructions into a vision-language model's context
Synthesizes 3D motion from multiple 2D views via fusion
🔎 Similar Papers
No similar papers found.
W
William Barron
College of Connected Computing, Vanderbilt University, TN, USA
X
Xiaoxiang Dong
College of Connected Computing, Vanderbilt University, TN, USA
Matthew Johnson-Roberson
Matthew Johnson-Roberson
Professor of Robotics, Carnegie Mellon University
RoboticsField RoboticsAutonomous VehiclesMarine Robotics
W
Weiming Zhi
School of Computer Science, The University of Sydney, Australia