🤖 AI Summary
This work addresses persistent challenges in audio-driven video generation—particularly temporal coherence in long sequences, identity preservation, and generalization across complex scenes—without introducing a new architecture. By upgrading to the Whisper Large audio encoder, integrating refined training strategies, leveraging reinforcement learning from human feedback (RLHF), and employing advanced step distillation, the proposed method substantially enhances lip-sync accuracy, full-body motion consistency, and cross-domain generalization. Built upon large-scale, high-quality data curation and optimized training recipes, the system matches or surpasses leading closed-source solutions such as HeyGen, OmniHuman 1.5, and Kling Avatar 2.0 in both human evaluations and quantitative metrics across over 500 diverse test cases, while achieving efficient inference with only 8 NFEs without compromising output quality.
📝 Abstract
Despite advances in audio-driven video generation, achieving commercial-grade stability remains challenging. We present LongCat-Video-Avatar 1.5, an upgraded open-source framework prioritizing systematic engineering and production-readiness over architectural novelty. By upgrading the audio encoder to Whisper Large and meticulously scaling our training recipes, v1.5 achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency. Through rigorous data curation and RLHF Training, the model readily generalizes to stylized domains such as anime and animals, and natively handles complex real-world conditions, such as multi-person interactions and object handling. Furthermore, addressing the practical demands of industrial deployment, we employ advanced step distillation to accelerate inference to an optimal 8 NFE, achieving a favorable trade-off between serving efficiency and visual fidelity. The superiority of our approach is validated through extensive quantitative metrics and a rigorous human evaluation conducted on a comprehensive benchmark of over 500 diverse test cases. Results show that v1.5 achieves competitive or superior performance compared to leading closed-source systems (e.g., HeyGen, OmniHuman 1.5, Kling Avatar 2.0) across human-likeness ratings and expert-level quality assessments on our benchmark. With its open-source release, LongCat-Video-Avatar 1.5 narrows the gap between academic research prototypes and commercial-grade deployment.