🤖 AI Summary
Existing robotic planning methods lack explicit modeling of the dynamic cross-modal coupling between proprioceptive and semantic states, resulting in actions that are neither embodied nor semantically consistent. This work proposes a cross-modal latent dynamics framework that jointly models their evolution in a shared latent space through an asymmetric cross-attention mechanism. To prevent representational collapse, the approach integrates self-supervised objectives, auxiliary reconstruction losses, and an exponential moving average (EMA) target encoder. The predicted embodied future state is fused with observations to condition a diffusion-based policy for action generation. Evaluated on the LIBERO-LONG benchmark, the method achieves a 94.7% success rate with significantly fewer parameters, matching the performance of large vision-language-action models.
📝 Abstract
Robotic manipulation involves kinematic and semantic transitions that are inherently coupled via underlying actions. However, existing approaches plan within either semantic or latent space without explicitly aligning these cross-modal transitions. To address this, we propose CLaD, a framework that models how proprioceptive and semantic states jointly evolve under actions through asymmetric cross-attention that allows kinematic transitions to query semantic ones. CLaD predicts grounded latent foresights via self-supervised objectives with EMA target encoders and auxiliary reconstruction losses, preventing representation collapse while anchoring predictions to observable states. Predicted foresights are modulated with observations to condition a diffusion policy for action generation. On LIBERO-LONG benchmark, CLaD achieves 94.7\% success rate, competitive with large VLAs with significantly fewer parameters.