Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the limitations of existing multimodal audio-visual generation models—such as architectural complexity, low inference efficiency, and poor audio-visual synchronization—particularly in human-centric scenarios. To overcome these challenges, we propose daVinci-MagiHuman, an end-to-end generative framework based on a single-stream Transformer. Our approach unifies text, video, and audio into a shared token sequence and leverages pure self-attention mechanisms, eliminating multi-stream or cross-attention designs to significantly simplify the architecture. By innovatively integrating model distillation, latent-space super-resolution, and Turbo VAE decoding, our method achieves highly efficient inference, generating 5-second 256p videos in under 2 seconds on a single H100 GPU. Experiments demonstrate state-of-the-art performance in visual quality, text alignment, and speech word error rate (14.60%), with human evaluators preferring our outputs 80.0% of the time against Ovi 1.1 and 60.9% against LTX 2.3, while also supporting speech generation in six languages.

Technology Category

Application Category

📝 Abstract

We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.

Problem

Research questions and friction points this paper is trying to address.

audio-video generation

generative foundation model

human-centric synthesis

multilingual speech

synchronization

Innovation

Methods, ideas, or system contributions that make the work stand out.

single-stream architecture

audio-video generative model

self-attention only