🤖 AI Summary
While existing video world models can generate photorealistic frames, their dynamics often fail to faithfully capture the structural properties of actions. This work introduces group action theory into world modeling for the first time, formalizing action-conditioned dynamics as group actions on the latent state space. Structural correctness of action execution is enforced through regularization based on identity, inverse, and composition consistency. We propose a synthetic supervision mechanism that requires no additional data and introduce two novel structural evaluation metrics: Group-Action Consistency (GAC) and Group-Action Robustness (GAR). Experiments demonstrate that our approach significantly improves GAC and GAR performance over state-of-the-art models while preserving perceptual quality, thereby enhancing the stability of roll-out predictions and the structural fidelity of action dynamics.
📝 Abstract
Video world models have achieved strong visual realism, but this does not ensure that their dynamics are truly governed by actions. In this work, we argue that action faithfulness should be understood through the compositional structure of actions, which in many embodied settings follows a group structure (e.g., SE(2) for navigation). Based on this insight, we formalize action-conditioned world modeling as realizing a group action on the state space, providing a principled criterion for evaluating dynamics beyond visual quality. To operationalize this framework, we propose a unified approach that enforces identity, inverse, and composition consistency via latent-space regularization with synthesized supervision, avoiding additional data collection. We further introduce two metrics: Group-Action Consistency (GAC) and Group-Action Robustness (GAR), to evaluate structural correctness and rollout stability. Extensive experimental results show that our method consistently improves both GAC and GAR in state-of-the-art video world models without degrading perceptual quality.