🤖 AI Summary
Existing audio-driven character animation methods perform well in speech/singing scenarios but struggle with cinematic production requirements—particularly complex character interactions, realistic full-body motion, and dynamic camera modeling. To address these limitations, we propose Wan-S2V, an end-to-end audio-driven generation framework built upon the Wan architecture. It integrates high-fidelity 3D pose estimation with a learnable dynamic camera control module. Our model significantly improves long-video generation stability, lip-sync accuracy (±2 frames), and visual fidelity. Quantitatively and qualitatively, it outperforms state-of-the-art methods—including Hunyuan-Avatar and Omnihuman—across expressiveness, visual quality, and audio-visual alignment. Moreover, Wan-S2V enables controllable synthesis and fine-grained editing of cinematic-grade virtual character videos, supporting production-ready workflows.
📝 Abstract
Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.