EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

High acquisition cost and limited diversity of real-world robotic manipulation data—particularly in object appearance and environmental configurations—hinder the generalization of vision-language-action (VLA) models. To address this, we propose DreamTransfer: a framework that leverages diffusion Transformers to generate geometrically plausible, multi-view-consistent, and text-controllable embodied manipulation videos, enabling generative-data-driven zero-shot visual domain adaptation. We further introduce AdaMix, a dynamic sample reweighting strategy that adaptively emphasizes hard examples during training. Our approach integrates generative visual transfer, hybrid real-synthetic data training, and adaptive sample weighting. Experiments on zero-shot cross-domain manipulation tasks demonstrate that DreamTransfer improves performance by over 200% relative to a real-data-only baseline; incorporating AdaMix yields an additional 13% gain, significantly enhancing policy robustness and generalization across unseen domains and objects.

Technology Category

Application Category

📝 Abstract

Vision-language-action (VLA) models increasingly rely on diverse training data to achieve robust generalization. However, collecting large-scale real-world robot manipulation data across varied object appearances and environmental conditions remains prohibitively time-consuming and expensive. To overcome this bottleneck, we propose Embodied Manipulation Media Adaptation (EMMA), a VLA policy enhancement framework that integrates a generative data engine with an effective training pipeline. We introduce DreamTransfer, a diffusion Transformer-based framework for generating multi-view consistent, geometrically grounded embodied manipulation videos. DreamTransfer enables text-controlled visual editing of robot videos, transforming foreground, background, and lighting conditions without compromising 3D structure or geometrical plausibility. Furthermore, we explore hybrid training with real and generated data, and introduce AdaMix, a hard-sample-aware training strategy that dynamically reweights training batches to focus optimization on perceptually or kinematically challenging samples. Extensive experiments show that videos generated by DreamTransfer significantly outperform prior video generation methods in multi-view consistency, geometric fidelity, and text-conditioning accuracy. Crucially, VLAs trained with generated data enable robots to generalize to unseen object categories and novel visual domains using only demonstrations from a single appearance. In real-world robotic manipulation tasks with zero-shot visual domains, our approach achieves over a 200% relative performance gain compared to training on real data alone, and further improves by 13% with AdaMix, demonstrating its effectiveness in boosting policy generalization.

Problem

Research questions and friction points this paper is trying to address.

Overcoming costly real-world robot data collection through generative visual transfer

Enhancing robot generalization to unseen objects and visual domains

Improving policy performance via hybrid real-synthetic data training strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multi-view consistent robot videos

Enables text-controlled visual editing of videos

Uses hard-sample-aware training strategy AdaMix

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15