🤖 AI Summary
Current text-to-video models lack instance-level fine-grained control and global semantic consistency. To address this, we propose an instance-aware masked cross-attention mechanism and a shared-timestep adaptive prompt enhancement module, enabling the first parameter-efficient, spatially precise instance-level video generation. We further introduce spatially aware unconditional guidance to improve spatiotemporal coherence and construct InstanceBench—the first comprehensive evaluation benchmark tailored for instance controllability. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in instance localization accuracy, small-object retention, and overall video quality. Notably, it achieves superior performance with minimal parameter overhead, establishing a new paradigm for controllable video generation grounded in precise instance awareness and holistic semantic alignment.
📝 Abstract
Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we incorporate Spatially-Aware Unconditional Guidance during both training and inference to alleviate the disappearance of small instances. Finally, we propose a new benchmark, named InstanceBench, which combines general video quality metrics with instance-aware metrics for more comprehensive evaluation on instance-level video generation. Extensive experiments demonstrate that InstanceV not only achieves remarkable instance-level controllability in video generation, but also outperforms existing state-of-the-art models in both general quality and instance-aware metrics across qualitative and quantitative evaluations.