🤖 AI Summary
This work proposes AV-CASS, the first audiovisual framework for movie audio source separation, addressing the challenge that traditional audio-only methods struggle to achieve high-quality separation of speech, music, and sound effects in cinematic audio. The approach formulates the task as a visual-conditioned generation problem, leveraging visual context—introduced here for the first time—to enhance separation performance. AV-CASS employs a dual-stream visual encoder and utilizes synthetically generated training data combining facial and scene information. Through conditional flow matching, it enables multimodal generative modeling. Remarkably, trained exclusively on synthetic data, AV-CASS achieves state-of-the-art results across synthetic benchmarks, real-world movie scenes, and audio-only evaluation settings, demonstrating both the efficacy and strong generalization capability of audiovisual fusion for audio source separation.
📝 Abstract
Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks. Code and demo are available at \url{https://cass-flowmatching.github.io}.