Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

๐Ÿ“… 2025-06-10
๐Ÿ›๏ธ Computer Vision and Pattern Recognition
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of aligning semantic content and rhythmic structure in video-to-audio generation. The authors propose a mask modelingโ€“based audiovisual alignment mechanism that achieves cross-modal synchronization by jointly optimizing independently pretrained audio and video encoders. To further enhance temporal coherence, they introduce a dynamic conditional flow architecture that leverages time-varying visual features to dynamically guide the generation of audio segments. This design effectively balances global semantic consistency with local rhythmic alignment. Evaluated on standard benchmarks, the proposed method significantly outperforms existing approaches, achieving state-of-the-art performance across multiple objective and subjective metrics.

Technology Category

Application Category

๐Ÿ“ Abstract
Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose Foley-Flow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are evaluated on the standard benchmarks and largely surpass existing results under several metrics. The superior performance indicates that Foley-Flow is effective in generating coordinated audios that are both semantically and rhythmically coherent to various video sequences.
Problem

Research questions and friction points this paper is trying to address.

video-to-audio generation
audio-visual alignment
rhythmic synchronization
semantic coherence
coordinated audio generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

masked audio-visual alignment
dynamic conditional flows
video-to-audio generation
rhythmic synchronization
velocity flow
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Shentong Mo
CMU / MBZUAI, DAMO Academy, Alibaba Group
Yibing Song
Yibing Song
Deputy Chief Engineer, BYD Group
Multi-Modal AI