🤖 AI Summary
Multi-reference image-driven video generation faces the challenge of simultaneously ensuring multi-subject consistency and high generation quality. This paper introduces the first unified video generation framework capable of jointly conditioning on an arbitrary number of reference images—spanning diverse categories (e.g., humans, objects, backgrounds)—and textual prompts. Our method builds upon diffusion models and integrates multi-reference encoding, text alignment, and mask-guided synthesis. Key contributions include: (1) a region-aware dynamic masking mechanism enabling subject-adaptive conditional injection; and (2) a pixel-wise channel concatenation strategy that allows zero-modification generalization of a single model to multi-subject inference and fine-grained controllable synthesis. Evaluated on a newly constructed multi-subject video benchmark, our approach achieves state-of-the-art performance—significantly outperforming both open-source and commercial baselines—while delivering high fidelity, strong controllability, and training-deployment efficiency.
📝 Abstract
Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF