🤖 AI Summary
This work addresses the issue of object identity drift in video generation by driving world models, which arises from the lack of instance-level temporal constraints. To mitigate this, the authors propose an instance mask–based modeling approach that explicitly enforces identity and temporal consistency through two key components: instance identity masks and trajectory masks. They introduce an instance mask attention mechanism and a probabilistic mask–guided adaptive foreground-weighted loss function, integrating these masks into both attention computation and the optimization objective. This method achieves fine-grained instance-level identity preservation—the first of its kind in driving world models—and demonstrates significant improvements in generated video quality on the nuScenes dataset. Furthermore, it effectively enhances performance on downstream autonomous driving tasks.
📝 Abstract
Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.