🤖 AI Summary
This work addresses identity drift and geometric distortion in image-to-video generation caused by single-view inputs by proposing the ConsID-Gen framework, which integrates semantic and structural information from pose-free auxiliary views to achieve cross-view identity consistency and temporal coherence. The contributions include the construction of ConsIDVid, a high-quality multi-view aligned dataset; the design of a dual-stream visual-geometric encoder and a text-visual connector; and the establishment of ConsIDVid-Bench, the first evaluation benchmark for multi-view consistent image-to-video generation. Built upon a diffusion Transformer backbone and enhanced with multi-view augmentation and cross-modal alignment techniques, the model significantly outperforms state-of-the-art methods such as Wan2.1 and HunyuanVideo on ConsIDVid-Bench, achieving superior identity fidelity and temporal consistency in complex real-world scenarios.
📝 Abstract
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.