DreamFoley: Scalable VLMs for High-Fidelity Video-to-Audio Generation

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing video generation methods improve visual fidelity but commonly lack high-quality, content-synchronized audio, severely limiting immersive experience and practical deployment. To address this, we propose the first autoregressive audio-visual-video (AVV) generation framework grounded in large vision-language models (VLMs). Our method employs a dual visual encoder to jointly model video, audio, and text; integrates a residual vector quantization (RVQ)-based audio tokenizer with a delayed-generation mechanism to enhance temporal alignment; and pioneers the adaptation of classifier-free guidance (CFG) to VLM-based AVV generation, significantly improving controllability and audio quality. We achieve state-of-the-art performance across multiple benchmarks, establish an end-to-end audio-visual-text generation pipeline, and publicly release the first large-scale, precisely aligned audio-visual-text description benchmark—enabling fair evaluation and advancing multimodal generative modeling research.

Technology Category

Application Category

📝 Abstract

Recent advances in video generation have achieved remarkable improvements in visual content fidelity. However, the absence of synchronized audio severely undermines immersive experience and restricts practical applications of these technologies. To address this challenge, several pioneering works have explored diffusion transformer architectures for generating plausible video-synchronized audio, including Kling-foley, HunyuanVideo-foley and Thinksound. Distinct from existing works, we introduce an autoregressive audio generation architecture (DreamFoley) that harnesses the capabilities of large vision-language models (VLMs) to jointly model sequential interactions among video, audio, and text modalities. Our approach features a dual-visual encoder module that effectively captures both audio-aligned and text-aligned visual features. Additionally, we employ a Residual Vector Quantization audio tokenizer with a delay-pattern generation scheme to balance the trade-off between training efficiency and audio quality. Moreover, we introduce the classifier-free guidance strategy into VLMs to bootstrap generated audio quality. Furthermore, we establish an efficient data production pipeline to scale audio-video-text triple collection. Finally, extensive experiments are conducted to validate the effectiveness of our model, achieving promising performance across popular benchmarks. We hope that the findings in this study provide a strong foundation for future video-to-audio generation research. We also release the previously missing audio-visual textual descriptions from the public benchmark, aiming to facilitate subsequent researchers in conducting more convenient and effective evaluations and comparisons.

Problem

Research questions and friction points this paper is trying to address.

Generates synchronized audio for videos using autoregressive VLMs

Improves audio quality via dual-visual encoders and quantization tokenizer

Scales data collection and evaluation for video-to-audio generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive audio generation with large vision-language models

Dual-visual encoder captures audio and text features

Residual Vector Quantization tokenizer balances efficiency and quality

🔎 Similar Papers

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling