Cinematic Audio Source Separation Using Visual Cues

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work proposes AV-CASS, the first audiovisual framework for movie audio source separation, addressing the challenge that traditional audio-only methods struggle to achieve high-quality separation of speech, music, and sound effects in cinematic audio. The approach formulates the task as a visual-conditioned generation problem, leveraging visual context—introduced here for the first time—to enhance separation performance. AV-CASS employs a dual-stream visual encoder and utilizes synthetically generated training data combining facial and scene information. Through conditional flow matching, it enables multimodal generative modeling. Remarkably, trained exclusively on synthetic data, AV-CASS achieves state-of-the-art results across synthetic benchmarks, real-world movie scenes, and audio-only evaluation settings, demonstrating both the efficacy and strong generalization capability of audiovisual fusion for audio source separation.

Technology Category

Application Category

📝 Abstract

Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks. Code and demo are available at \url{https://cass-flowmatching.github.io}.

Problem

Research questions and friction points this paper is trying to address.

Cinematic Audio Source Separation

audio-visual

source separation

film audio

visual cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-visual source separation

conditional flow matching

cinematic audio