🤖 AI Summary
This work addresses the challenge of modeling a unified shared world under multi-agent interactions in video generation. To this end, the authors propose a multi-agent consistent video generation framework that integrates multi-view videos and cross-agent spatiotemporal interactions to achieve coherent shared-world modeling. The key contributions include constructing the first large-scale multi-agent interaction video dataset based on CARLA, designing a four-view spatial stitching strategy to ensure geometric consistency, and introducing a cross-agent attention mechanism that enforces consistency in overlapping regions while generating plausible content in non-overlapping areas. Leveraging a pre-trained large video model and the CARLA simulation platform, the method supports generation of 49-frame-long sequences with accurate perception of dynamic agent positions, significantly improving the consistency and plausibility of multi-view videos.
📝 Abstract
This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.