Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that purely vision-based methods struggle to infer critical control parameters—such as force—required for robotic manipulation. To overcome this limitation, we propose a novel end-to-end paradigm for learning executable task plans and dynamic control parameters directly from multimodal human demonstration videos encompassing RGB, electromyography (EMG), and audio modalities. Our key contribution is the Chain-of-Modality (CoM) progressive prompting strategy, the first to enable vision-language models to perform joint semantic reasoning over video–EMG–audio inputs. CoM employs staged modality injection and cross-modal semantic alignment to construct a unified multimodal fusion architecture. Experiments demonstrate that our method achieves a threefold improvement in task plan and control parameter extraction accuracy over baseline approaches. Moreover, it successfully generalizes to unseen objects and environments on real robots, significantly outperforming both unimodal and conventional early-fusion multimodal methods.

Technology Category

Application Category

📝 Abstract
Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data -- videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real-world robot experiments. Videos and code are available at https://chain-of-modality.github.io
Problem

Research questions and friction points this paper is trying to address.

Extracting task plans from multimodal human videos
Capturing control parameters like force using additional sensors
Enabling robots to perform tasks via multimodal human demonstrations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging multimodal data for robot learning
Chain-of-Modality prompting for VLMs
Progressively integrating vision, muscle, audio
🔎 Similar Papers
No similar papers found.