🤖 AI Summary
This work addresses the performance degradation of universal policies in cross-embodiment robotic learning due to morphological differences. The authors propose a morphology-aware Transformer-based policy architecture that explicitly integrates robot morphology through three mechanisms: kinematic tokens derived from joint-wise action decomposition and temporal compression, attention biases guided by kinematic topology, and conditional encodings incorporating joint semantic attributes. Built upon a vision-language-action (VLA) framework, the approach combines self-attention with temporal chunking to effectively leverage morphological priors. Experiments demonstrate that the method significantly outperforms the pi0.5 VLA baseline across diverse robot embodiments, while simultaneously enhancing generalization and robustness in both single-embodiment and cross-embodiment tasks.
📝 Abstract
Cross-robot policy learning -- training a single policy to perform well across multiple embodiments -- remains a central challenge in robot learning. Transformer-based policies, such as vision-language-action (VLA) models, are typically embodiment-agnostic and must infer kinematic structure purely from observations, which can reduce robustness across embodiments and even limit performance within a single embodiment. We propose an embodiment-aware transformer policy that injects morphology via three mechanisms: (1) kinematic tokens that factorize actions across joints and compress time through per-joint temporal chunking; (2) a topology-aware attention bias that encodes kinematic topology as an inductive bias in self-attention, encouraging message passing along kinematic edges; and (3) joint-attribute conditioning that augments topology with per-joint descriptors to capture semantics beyond connectivity. Across a range of embodiments, this structured integration consistently improves performance over a vanilla pi0.5 VLA baseline, indicating improved robustness both within an embodiment and across embodiments.