Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenges of modeling heterogeneous multimodal data and weak policy generalization in lightweight cross-morphology robotic manipulation, this paper proposes a lightweight Diffusion Transformer architecture. Our method introduces three key innovations: (1) a cross-morphology normalizer that unifies heterogeneous embodiment representations; (2) a joint state-temporal encoder that fuses multi-view RGB, proprioceptive, and language inputs; and (3) an optimized diffusion-based action decoder enabling high-precision, low-latency bimanual control. Under matched computational budgets, our approach achieves an in-distribution average success rate of 88.95%—a +70.83 percentage-point improvement over baseline methods—while maintaining strong robustness to object and scene variations. This significantly enhances cross-morphology policy transferability and real-time inference efficiency.

Technology Category

Application Category

📝 Abstract

Scaling Transformer policies and diffusion models has advanced robotic manipulation, yet combining these techniques in lightweight, cross-embodiment learning settings remains challenging. We study design choices that most affect stability and performance for diffusion-transformer policies trained on heterogeneous, multimodal robot data, and introduce Tenma, a lightweight diffusion-transformer for bi-manual arm control. Tenma integrates multiview RGB, proprioception, and language via a cross-embodiment normalizer that maps disparate state/action spaces into a shared latent space; a Joint State-Time encoder for temporally aligned observation learning with inference speed boosts; and a diffusion action decoder optimized for training stability and learning capacity. Across benchmarks and under matched compute, Tenma achieves an average success rate of 88.95% in-distribution and maintains strong performance under object and scene shifts, substantially exceeding baseline policies whose best in-distribution average is 18.12%. Despite using moderate data scale, Tenma delivers robust manipulation and generalization, indicating the great potential for multimodal and cross-embodiment learning strategies for further augmenting the capacity of transformer-based imitation learning policies.

Problem

Research questions and friction points this paper is trying to address.

Lightweight cross-embodiment learning for robot manipulation

Integrating multimodal data into shared latent space

Improving diffusion-transformer policy stability and performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight diffusion-transformer for bimanual control

Cross-embodiment normalizer for shared latent space

Joint State-Time encoder for temporal alignment

🔎 Similar Papers

No similar papers found.

Authors to Follow