🤖 AI Summary
This work addresses key limitations in existing video-to-audio generation methods, which suffer from insufficient text controllability under visual-textual conflicts, imprecise style control due to entangled temporal and timbral information in reference audio, and the absence of standardized evaluation benchmarks. To overcome these challenges, the authors propose a unified multimodal video-to-audio generation framework that enhances cross-modal alignment through joint vision–text encoding, introduces a time–timbre disentanglement mechanism for fine-grained style control, and incorporates a modality-robust training strategy—including random modality dropout—to improve system stability. Furthermore, they construct VGGSound-TVC, the first benchmark specifically designed for evaluating performance in visual–textual conflict scenarios. Experiments demonstrate that the proposed method achieves state-of-the-art results across multiple tasks, maintaining superior controllability, audio–visual synchronization, and audio quality even under conflicting conditions, matching or surpassing current industrial-grade systems.
📝 Abstract
Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation.
We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict.
Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system.
Code, models, datasets, and demos are available at: https://yjx-research.github.io/ControlFoley/.