MAGREF: Masked Guidance for Any-Reference Video Generation

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Multi-reference image-driven video generation faces the challenge of simultaneously ensuring multi-subject consistency and high generation quality. This paper introduces the first unified video generation framework capable of jointly conditioning on an arbitrary number of reference images—spanning diverse categories (e.g., humans, objects, backgrounds)—and textual prompts. Our method builds upon diffusion models and integrates multi-reference encoding, text alignment, and mask-guided synthesis. Key contributions include: (1) a region-aware dynamic masking mechanism enabling subject-adaptive conditional injection; and (2) a pixel-wise channel concatenation strategy that allows zero-modification generalization of a single model to multi-subject inference and fine-grained controllable synthesis. Evaluated on a newly constructed multi-subject video benchmark, our approach achieves state-of-the-art performance—significantly outperforming both open-source and commercial baselines—while delivering high fidelity, strong controllability, and training-deployment efficiency.

Technology Category

Application Category

📝 Abstract

Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

Problem

Research questions and friction points this paper is trying to address.

Maintaining multi-subject consistency in video generation

Ensuring high-quality synthesis from diverse reference images

Achieving scalable and controllable multi-subject video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked guidance for coherent multi-subject synthesis

Region-aware dynamic masking for flexible subject handling

Pixel-wise channel concatenation for appearance preservation

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Video Generation and Post Training, FAIR