NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of acquiring high-quality expert demonstrations for non-humanoid robots (e.g., quadrupeds, insect-like agents) and the heavy reliance of reinforcement learning on hand-crafted reward engineering, this paper introduces a novel “data-free imitation learning” paradigm. It bypasses real-world motion capture data entirely, instead distilling physically plausible 3D locomotion skills directly from synthetic 2D videos generated by pre-trained video diffusion models. Methodologically, we propose a dual-path vision-guided reward—comprising video embedding distance and frame-wise semantic segmentation similarity—integrated with a Vision Transformer-based video encoder, kinematically constrained 3D motion optimization, and a cross-modal alignment loss. Experiments demonstrate that our approach surpasses motion-capture-based baselines across diverse robot morphologies, producing high-fidelity, dynamics-feasible control policies solely from synthetic video. Notably, it generalizes successfully to unconventional morphologies such as ant-inspired robots.

Technology Category

Application Category

📝 Abstract
Acquiring physically plausible motor skills across diverse and unconventional morphologies-including humanoid robots, quadrupeds, and animals-is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL) are task- and body-specific, require extensive reward function engineering, and do not generalize well. Imitation learning offers an alternative but relies heavily on high-quality expert demonstrations, which are difficult to obtain for non-human morphologies. Video diffusion models, on the other hand, are capable of generating realistic videos of various morphologies, from humans to ants. Leveraging this capability, we propose a data-independent approach for skill acquisition that learns 3D motor skills from 2D-generated videos, with generalization capability to unconventional and non-human forms. Specifically, we guide the imitation learning process by leveraging vision transformers for video-based comparisons by calculating pair-wise distance between video embeddings. Along with video-encoding distance, we also use a computed similarity between segmented video frames as a guidance reward. We validate our method on locomotion tasks involving unique body configurations. In humanoid robot locomotion tasks, we demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines trained on 3D motion-capture data. Our results highlight the potential of leveraging generative video models for physically plausible skill learning with diverse morphologies, effectively replacing data collection with data generation for imitation learning.
Problem

Research questions and friction points this paper is trying to address.

Acquiring motor skills for diverse and unconventional morphologies
Overcoming reliance on high-quality expert demonstrations
Generalizing skill learning to non-human forms using video diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained video diffusion models
Uses vision transformers for video comparisons
Replaces data collection with data generation