Train a Multi-Task Diffusion Policy on RLBench-18 in One Day with One GPU

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

To address the high computational cost and inefficiency of training diffusion-based vision-language robotic policies for multi-task settings on a single GPU, this paper proposes Mini-Diffuser. Methodologically, it introduces Level-2 minibatching—a novel batching mechanism that decouples action noise sampling from high-dimensional vision-language conditioning, thereby mitigating condition-action dimension asymmetry. It further designs a diffusion Transformer architecture with explicit information-leakage prevention, enabling efficient batched training without compromising strong conditional modeling capability. Evaluated on RLBench-18, Mini-Diffuser achieves 95% of state-of-the-art diffusion policy performance while requiring only one day of training on a single GPU; training time and GPU memory consumption are reduced to 5% and 7% of prior methods, respectively. Experiments confirm its retained multimodal action modeling capacity and robust generalization across diverse perceptual inputs.

Technology Category

Application Category

📝 Abstract

We present a method for training multi-task vision-language robotic diffusion policies that reduces training time and memory usage by an order of magnitude. This improvement arises from a previously underexplored distinction between action diffusion and the image diffusion techniques that inspired it: image generation targets are high-dimensional, while robot actions lie in a much lower-dimensional space. Meanwhile, the vision-language conditions for action generation remain high-dimensional. Our approach, Mini-Diffuser, exploits this asymmetry by introducing Level-2 minibatching, which pairs multiple noised action samples with each vision-language condition, instead of the conventional one-to-one sampling strategy. To support this batching scheme, we introduce architectural adaptations to the diffusion transformer that prevent information leakage across samples while maintaining full conditioning access. In RLBench simulations, Mini-Diffuser achieves 95% of the performance of state-of-the-art multi-task diffusion policies, while using only 5% of the training time and 7% of the memory. Real-world experiments further validate that Mini-Diffuser preserves the key strengths of diffusion-based policies, including the ability to model multimodal action distributions and produce behavior conditioned on diverse perceptual inputs. Code available at github.com/utomm/mini-diffuse-actor.

Problem

Research questions and friction points this paper is trying to address.

Reduce training time for multi-task robotic diffusion policies

Minimize memory usage in vision-language action generation

Maintain performance while optimizing computational resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Level-2 minibatching pairs multiple actions per condition

Architectural adaptations prevent information leakage

Reduces training time and memory usage significantly

🔎 Similar Papers

No similar papers found.