MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Multi-subject video generation faces two key challenges: scale inconsistency (abrupt subject size changes) and permutation sensitivity (output quality degrades with varying reference image order). To address these, we propose the first text-driven, multi-reference video generation framework robust to both scale variations and input permutations. Our approach comprises three core components: (1) an LLM-guided Scale-aware Modulation Operator (SMO) for semantic-aligned, adaptive feature scaling; (2) an FFT-based frequency-domain feature fusion mechanism that decouples spatial structure from global layout; and (3) a novel scale-permutation stability loss jointly optimizing temporal consistency and permutation invariance. Evaluated on our newly constructed Scale-Perm benchmark, our method significantly improves subject naturalness, visual fidelity, and motion coherence, consistently outperforming existing state-of-the-art methods across all metrics.

Technology Category

Application Category

📝 Abstract

Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.

Problem

Research questions and friction points this paper is trying to address.

Addresses scale inconsistency in multi-subject video generation

Resolves permutation sensitivity to reference input order

Ensures natural scale and visual fidelity from text prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided Scale-Aware Modulation for consistent subject sizes

Fourier Fusion via Fast Fourier Transform for unified representation

Scale-Permutation Stability Loss for invariant generation

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

AI Research Scientist, Video Generation and Post Training, FAIR