Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer

๐Ÿ“… 2025-09-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenges of modeling heterogeneous multimodal data and weak policy generalization in lightweight cross-morphology robotic manipulation, this paper proposes a lightweight Diffusion Transformer architecture. Our method introduces three key innovations: (1) a cross-morphology normalizer that unifies heterogeneous embodiment representations; (2) a joint state-temporal encoder that fuses multi-view RGB, proprioceptive, and language inputs; and (3) an optimized diffusion-based action decoder enabling high-precision, low-latency bimanual control. Under matched computational budgets, our approach achieves an in-distribution average success rate of 88.95%โ€”a +70.83 percentage-point improvement over baseline methodsโ€”while maintaining strong robustness to object and scene variations. This significantly enhances cross-morphology policy transferability and real-time inference efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Scaling Transformer policies and diffusion models has advanced robotic manipulation, yet combining these techniques in lightweight, cross-embodiment learning settings remains challenging. We study design choices that most affect stability and performance for diffusion-transformer policies trained on heterogeneous, multimodal robot data, and introduce Tenma, a lightweight diffusion-transformer for bi-manual arm control. Tenma integrates multiview RGB, proprioception, and language via a cross-embodiment normalizer that maps disparate state/action spaces into a shared latent space; a Joint State-Time encoder for temporally aligned observation learning with inference speed boosts; and a diffusion action decoder optimized for training stability and learning capacity. Across benchmarks and under matched compute, Tenma achieves an average success rate of 88.95% in-distribution and maintains strong performance under object and scene shifts, substantially exceeding baseline policies whose best in-distribution average is 18.12%. Despite using moderate data scale, Tenma delivers robust manipulation and generalization, indicating the great potential for multimodal and cross-embodiment learning strategies for further augmenting the capacity of transformer-based imitation learning policies.
Problem

Research questions and friction points this paper is trying to address.

Lightweight cross-embodiment learning for robot manipulation
Integrating multimodal data into shared latent space
Improving diffusion-transformer policy stability and performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight diffusion-transformer for bimanual control
Cross-embodiment normalizer for shared latent space
Joint State-Time encoder for temporal alignment
๐Ÿ”Ž Similar Papers
No similar papers found.