🤖 AI Summary
This work addresses the challenge of learning state-space models of dynamical systems directly from unlabeled observational data (e.g., image sequences). We propose a structured continuous-time world model that integrates Joint Embedding Predictive Architecture (JEPA) with Neural Ordinary Differential Equations (Neural ODEs). Our method explicitly induces a low-dimensional, geometrically well-behaved latent state space by enforcing contraction in the embedding space and Lipschitz regularization on the latent state manifold. Leveraging contrastive loss and sequential embedding alignment, the model enables end-to-end learning solely from raw observations—without access to ground-truth states or dynamics. Experiments on pendulum image sequences demonstrate that the learned latent states are physically interpretable, temporally consistent, and ordered; they yield substantially improved long-horizon prediction accuracy and cross-condition generalization. This framework establishes a new paradigm for data-driven robotic modeling, control, and state estimation: it is differentiable, interpretable, and requires no prior assumptions about system dynamics.
📝 Abstract
With the advent of Joint Embedding Predictive Architectures (JEPAs), which appear to be more capable than reconstruction-based methods, this paper introduces a novel technique for creating world models using continuous-time dynamic systems from arbitrary observation data. The proposed method integrates sequence embeddings with neural ordinary differential equations (neural ODEs). It employs loss functions that enforce contractive embeddings and Lipschitz constants in state transitions to construct a well-organized latent state space. The approach's effectiveness is demonstrated through the generation of structured latent state-space models for a simple pendulum system using only image data. This opens up a new technique for developing more general control algorithms and estimation techniques with broad applications in robotics.