🤖 AI Summary
Multi-subject video generation faces two key challenges: scale inconsistency (abrupt subject size changes) and permutation sensitivity (output quality degrades with varying reference image order). To address these, we propose the first text-driven, multi-reference video generation framework robust to both scale variations and input permutations. Our approach comprises three core components: (1) an LLM-guided Scale-aware Modulation Operator (SMO) for semantic-aligned, adaptive feature scaling; (2) an FFT-based frequency-domain feature fusion mechanism that decouples spatial structure from global layout; and (3) a novel scale-permutation stability loss jointly optimizing temporal consistency and permutation invariance. Evaluated on our newly constructed Scale-Perm benchmark, our method significantly improves subject naturalness, visual fidelity, and motion coherence, consistently outperforming existing state-of-the-art methods across all metrics.
📝 Abstract
Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.