ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses key limitations in existing video-to-audio generation methods, which suffer from insufficient text controllability under visual-textual conflicts, imprecise style control due to entangled temporal and timbral information in reference audio, and the absence of standardized evaluation benchmarks. To overcome these challenges, the authors propose a unified multimodal video-to-audio generation framework that enhances cross-modal alignment through joint vision–text encoding, introduces a time–timbre disentanglement mechanism for fine-grained style control, and incorporates a modality-robust training strategy—including random modality dropout—to improve system stability. Furthermore, they construct VGGSound-TVC, the first benchmark specifically designed for evaluating performance in visual–textual conflict scenarios. Experiments demonstrate that the proposed method achieves state-of-the-art results across multiple tasks, maintaining superior controllability, audio–visual synchronization, and audio quality even under conflicting conditions, matching or surpassing current industrial-grade systems.

Technology Category

Application Category

📝 Abstract
Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict. Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: https://yjx-research.github.io/ControlFoley/.
Problem

Research questions and friction points this paper is trying to address.

video-to-audio generation
controllability
cross-modal conflict
textual control
audio style
Innovation

Methods, ideas, or system contributions that make the work stand out.

video-to-audio generation
cross-modal conflict
temporal-timbre decoupling
multimodal controllability
modality-robust training
J
Jianxuan Yang
MiLM Plus, Xiaomi Inc.
X
Xinyue Guo
MiLM Plus, Xiaomi Inc.
Z
Zhi Cheng
MiLM Plus, Xiaomi Inc., Wuhan University
K
Kai Wang
MiLM Plus, Xiaomi Inc., Wuhan University
L
Lipan Zhang
MiLM Plus, Xiaomi Inc.
J
Jinjie Hu
MiLM Plus, Xiaomi Inc.
Q
Qiang Ji
MiLM Plus, Xiaomi Inc.
Y
Yihua Cao
MiLM Plus, Xiaomi Inc.
Y
Yihao Meng
MiLM Plus, Xiaomi Inc., Wuhan University
Z
Zhaoyue Cui
MiLM Plus, Xiaomi Inc., Wuhan University
M
Mengmei Liu
MiLM Plus, Xiaomi Inc.
Meng Meng
Meng Meng
Associate Professor, University of Bath
Sustainable transportNetwork modelling and optimisationTravel behaviour analysis
Jian Luan
Jian Luan
Toshiba, Microsoft, Xiaomi
LLMVLMTTSSinging Synthesis