🤖 AI Summary
This work addresses the limitation of existing world models, which are often confined to a single viewpoint—such as egocentric perspective—and thus unable to leverage more informative views like bird’s-eye view for efficient planning. To overcome this, the authors propose the Cross-View World Model (XVWM), trained on synchronized, multi-view gameplay replays with high-frequency action annotations. By introducing cross-view state prediction as a geometric regularizer, XVWM is encouraged to learn a viewpoint-invariant, three-dimensional representation of the environment. This enables the model to predict future states in any target view from an arbitrary input view, facilitating parallel imagination across perspectives and adaptive planning conditioned on task requirements. Experiments demonstrate that XVWM successfully constructs a spatially consistent embodied environmental representation, substantially enhancing planning flexibility and providing a foundation for viewpoint switching in multi-agent systems.
📝 Abstract
World models enable agents to plan by imagining future states, but existing approaches operate from a single viewpoint, typically egocentric, even when other perspectives would make planning easier; navigation, for instance, benefits from a bird's-eye view. We introduce Cross-View World Models (XVWM), trained with a cross-view prediction objective: given a sequence of frames from one viewpoint, predict the future state from the same or a different viewpoint after an action is taken. Enforcing cross-view consistency acts as geometric regularization: because the input and output views may share little or no visual overlap, to predict across viewpoints, the model must learn view-invariant representations of the environment's 3D structure. We train on synchronized multi-view gameplay data from Aimlabs, an aim-training platform providing precisely aligned multi-camera recordings with high-frequency action labels. The resulting model gives agents parallel imagination streams across viewpoints, enabling planning in whichever frame of reference best suits the task while executing from the egocentric view. Our results show that multi-view consistency provides a strong learning signal for spatially grounded representations. Finally, predicting the consequences of one's actions from another viewpoint may offer a foundation for perspective-taking in multi-agent settings.