🤖 AI Summary
Existing methods predominantly rely on a single-subject assumption, injecting text/image/audio conditions globally, thereby failing to support fine-grained spatiotemporal control for multi-person or human-object interaction scenarios. This work proposes the first region-level multimodal conditional binding framework: (1) identity-aware layout inference via appearance-guided mask prediction; (2) learnable region-specific audio embeddings coupled with iterative layout alignment to ensure cross-modal spatiotemporal consistency; and (3) a diffusion-based video generation architecture. Our approach overcomes the single-subject limitation, enabling precise per-region conditioning. It significantly improves controllability and visual fidelity in multi-character interaction, human-object collaboration, and speech-driven motion synchronization tasks. Quantitative and qualitative evaluations demonstrate consistent superiority over state-of-the-art methods across multiple metrics.
📝 Abstract
End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.