ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses key limitations in existing video-to-audio generation methods, which suffer from insufficient text controllability under visual-textual conflicts, imprecise style control due to entangled temporal and timbral information in reference audio, and the absence of standardized evaluation benchmarks. To overcome these challenges, the authors propose a unified multimodal video-to-audio generation framework that enhances cross-modal alignment through joint vision–text encoding, introduces a time–timbre disentanglement mechanism for fine-grained style control, and incorporates a modality-robust training strategy—including random modality dropout—to improve system stability. Furthermore, they construct VGGSound-TVC, the first benchmark specifically designed for evaluating performance in visual–textual conflict scenarios. Experiments demonstrate that the proposed method achieves state-of-the-art results across multiple tasks, maintaining superior controllability, audio–visual synchronization, and audio quality even under conflicting conditions, matching or surpassing current industrial-grade systems.

Technology Category

Application Category

📝 Abstract

Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict. Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: https://yjx-research.github.io/ControlFoley/.

Problem

Research questions and friction points this paper is trying to address.

video-to-audio generation

controllability

cross-modal conflict

textual control

audio style

Innovation

Methods, ideas, or system contributions that make the work stand out.

video-to-audio generation

cross-modal conflict

temporal-timbre decoupling