DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation methods improve visual fidelity but commonly lack high-quality, content-synchronized audio, severely limiting immersive experience and practical deployment. To address this, we propose the first autoregressive audio-visual-video (AVV) generation framework grounded in large vision-language models (VLMs). Our method employs a dual visual encoder to jointly model video, audio, and text; integrates a residual vector quantization (RVQ)-based audio tokenizer with a delayed-generation mechanism to enhance temporal alignment; and pioneers the adaptation of classifier-free guidance (CFG) to VLM-based AVV generation, significantly improving controllability and audio quality. We achieve state-of-the-art performance across multiple benchmarks, establish an end-to-end audio-visual-text generation pipeline, and publicly release the first large-scale, precisely aligned audio-visual-text description benchmark—enabling fair evaluation and advancing multimodal generative modeling research.

Technology Category

Application Category

📝 Abstract
Recent advances in video generation have achieved remarkable improvements in visual content fidelity. However, the absence of synchronized audio severely undermines immersive experience and restricts practical applications of these technologies. To address this challenge, several pioneering works have explored diffusion transformer architectures for generating plausible video-synchronized audio, including Kling-foley, HunyuanVideo-foley and Thinksound. Distinct from existing works, we introduce an autoregressive audio generation architecture (DreamFoley) that harnesses the capabilities of large vision-language models (VLMs) to jointly model sequential interactions among video, audio, and text modalities. Our approach features a dual-visual encoder module that effectively captures both audio-aligned and text-aligned visual features. Additionally, we employ a Residual Vector Quantization audio tokenizer with a delay-pattern generation scheme to balance the trade-off between training efficiency and audio quality. Moreover, we introduce the classifier-free guidance strategy into VLMs to bootstrap generated audio quality. Furthermore, we establish an efficient data production pipeline to scale audio-video-text triple collection. Finally, extensive experiments are conducted to validate the effectiveness of our model, achieving promising performance across popular benchmarks. We hope that the findings in this study provide a strong foundation for future video-to-audio generation research. We also release the previously missing audio-visual textual descriptions from the public benchmark, aiming to facilitate subsequent researchers in conducting more convenient and effective evaluations and comparisons.
Problem

Research questions and friction points this paper is trying to address.

Generates synchronized audio for videos using autoregressive VLMs
Improves audio quality via dual-visual encoders and quantization tokenizer
Scales data collection and evaluation for video-to-audio generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive audio generation with large vision-language models
Dual-visual encoder captures audio and text features
Residual Vector Quantization tokenizer balances efficiency and quality
🔎 Similar Papers
No similar papers found.
F
Fu Li
Bytedance Intelligent Creation Lab
W
Weichao Zhao
Bytedance Intelligent Creation Lab
Y
You Li
Bytedance Intelligent Creation Lab, Zhejiang University
Zhichao Zhou
Zhichao Zhou
ShanghaiTech University
software engineering
Dongliang He
Dongliang He
ByteDance Inc.
Computer VisionDeep LearningMultimedia