Conditional Flow Matching for Visually-Guided Acoustic Highlighting

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a generative audio-visual remixing approach based on conditional flow matching (CFM) to address the misalignment between auditory and visual foci in audiovisual content. By performing cross-modal conditional vector field regression, the method integrates visual guidance signals to enable selective enhancement of target sound sources. Furthermore, a self-correcting trajectory optimization mechanism grounded in rollout loss is introduced to effectively mitigate error accumulation during the flow generation process. Experimental results demonstrate that the proposed method significantly outperforms existing discriminative approaches in both quantitative metrics and qualitative assessments, thereby validating the superiority and efficacy of generative modeling for audiovisual coordination tasks.

Technology Category

Application Category

📝 Abstract
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
Problem

Research questions and friction points this paper is trying to address.

visually-guided acoustic highlighting
audio-visual alignment
audio remixing
acoustic highlighting
cross-modal focus
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Flow Matching
visually-guided audio remixing
generative modeling
rollout loss
cross-modal source selection
🔎 Similar Papers
No similar papers found.