🤖 AI Summary
This work addresses the challenge of learning generalizable action representations from visual dynamics to enhance the sample efficiency and generalization of world models in low-data regimes. To this end, the authors propose SCAR, a framework that leverages a pretrained generative backbone and jointly trains inverse and forward dynamics models to encode actions as disentangled latent factors governing controllable visual changes. Disentanglement and cross-embodiment transferability are achieved through Gaussian prior regularization and adversarial invariance constraints. Empirical evaluation demonstrates that SCAR significantly improves both sample efficiency and transfer performance of world models on the Procgen and Robotwin benchmarks, enabling effective generalization across tasks and embodiments.
📝 Abstract
Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.