InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods predominantly rely on a single-subject assumption, injecting text/image/audio conditions globally, thereby failing to support fine-grained spatiotemporal control for multi-person or human-object interaction scenarios. This work proposes the first region-level multimodal conditional binding framework: (1) identity-aware layout inference via appearance-guided mask prediction; (2) learnable region-specific audio embeddings coupled with iterative layout alignment to ensure cross-modal spatiotemporal consistency; and (3) a diffusion-based video generation architecture. Our approach overcomes the single-subject limitation, enabling precise per-region conditioning. It significantly improves controllability and visual fidelity in multi-character interaction, human-object collaboration, and speech-driven motion synchronization tasks. Quantitative and qualitative evaluations demonstrate consistent superiority over state-of-the-art methods across multiple metrics.

Technology Category

Application Category

📝 Abstract
End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.
Problem

Research questions and friction points this paper is trying to address.

Animate multiple humans and objects with precise control
Align audio conditions to specific regions in video
Generate multi-concept human videos with layout matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Region-specific binding of multi-modal conditions
Automatic layout inference using mask predictor
Iterative layout-aligned audio condition injection
🔎 Similar Papers
No similar papers found.