๐ค AI Summary
To address the challenges of modeling heterogeneous multimodal data and weak policy generalization in lightweight cross-morphology robotic manipulation, this paper proposes a lightweight Diffusion Transformer architecture. Our method introduces three key innovations: (1) a cross-morphology normalizer that unifies heterogeneous embodiment representations; (2) a joint state-temporal encoder that fuses multi-view RGB, proprioceptive, and language inputs; and (3) an optimized diffusion-based action decoder enabling high-precision, low-latency bimanual control. Under matched computational budgets, our approach achieves an in-distribution average success rate of 88.95%โa +70.83 percentage-point improvement over baseline methodsโwhile maintaining strong robustness to object and scene variations. This significantly enhances cross-morphology policy transferability and real-time inference efficiency.
๐ Abstract
Scaling Transformer policies and diffusion models has advanced robotic manipulation, yet combining these techniques in lightweight, cross-embodiment learning settings remains challenging. We study design choices that most affect stability and performance for diffusion-transformer policies trained on heterogeneous, multimodal robot data, and introduce Tenma, a lightweight diffusion-transformer for bi-manual arm control. Tenma integrates multiview RGB, proprioception, and language via a cross-embodiment normalizer that maps disparate state/action spaces into a shared latent space; a Joint State-Time encoder for temporally aligned observation learning with inference speed boosts; and a diffusion action decoder optimized for training stability and learning capacity. Across benchmarks and under matched compute, Tenma achieves an average success rate of 88.95% in-distribution and maintains strong performance under object and scene shifts, substantially exceeding baseline policies whose best in-distribution average is 18.12%. Despite using moderate data scale, Tenma delivers robust manipulation and generalization, indicating the great potential for multimodal and cross-embodiment learning strategies for further augmenting the capacity of transformer-based imitation learning policies.