CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing robotic planning methods lack explicit modeling of the dynamic cross-modal coupling between proprioceptive and semantic states, resulting in actions that are neither embodied nor semantically consistent. This work proposes a cross-modal latent dynamics framework that jointly models their evolution in a shared latent space through an asymmetric cross-attention mechanism. To prevent representational collapse, the approach integrates self-supervised objectives, auxiliary reconstruction losses, and an exponential moving average (EMA) target encoder. The predicted embodied future state is fused with observations to condition a diffusion-based policy for action generation. Evaluated on the LIBERO-LONG benchmark, the method achieves a 94.7% success rate with significantly fewer parameters, matching the performance of large vision-language-action models.

Technology Category

Application Category

📝 Abstract

Robotic manipulation involves kinematic and semantic transitions that are inherently coupled via underlying actions. However, existing approaches plan within either semantic or latent space without explicitly aligning these cross-modal transitions. To address this, we propose CLaD, a framework that models how proprioceptive and semantic states jointly evolve under actions through asymmetric cross-attention that allows kinematic transitions to query semantic ones. CLaD predicts grounded latent foresights via self-supervised objectives with EMA target encoders and auxiliary reconstruction losses, preventing representation collapse while anchoring predictions to observable states. Predicted foresights are modulated with observations to condition a diffusion policy for action generation. On LIBERO-LONG benchmark, CLaD achieves 94.7\% success rate, competitive with large VLAs with significantly fewer parameters.

Problem

Research questions and friction points this paper is trying to address.

robotic manipulation

cross-modal dynamics

semantic-kinematic alignment

planning

latent foresight

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal latent dynamics

asymmetric cross-attention

grounded foresight

self-supervised prediction

diffusion policy

🔎 Similar Papers

No similar papers found.