SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the limitations of existing video editing methods, which typically employ decoupled pipelines and lack bidirectional audio-visual interaction, often resulting in audiovisual desynchronization and semantic inconsistencies. To overcome these issues, we propose the first end-to-end audio-visual joint editing framework that leverages a synchronization-aware mechanism and a context-aware module to enable cross-modal bidirectional interaction, thereby ensuring temporal alignment and semantic coherence. Key innovations include a synchronization-preserving training strategy, a bidirectional attention mechanism, and spatiotemporal-acoustic constraints. We also introduce SpongeBob-Bench, a large-scale dataset and evaluation benchmark for audio-visual editing. Experimental results demonstrate that our method achieves a 30% improvement on the Sync-C metric and a 12.5% gain in Ctx-F1, significantly outperforming current baselines.

📝 Abstract

Visual and acoustic events in the physical world are inherently coupled, yet existing video editing methods typically adopt decoupled pipelines, lacking bidirectional modality interaction. This results in two key limitations: (i) audio-visual desynchronization and (ii) contextual conflicts between generated audio and preserved content. To address these, we propose SpongeBob, the first end-to-end audio-visual joint editing framework featuring bidirectional cross-modal interaction. For synchronization, a Sync-Aware Mechanism aligns visual edits with sound events via bidirectional attention, temporal alignment, and spatial constraints. For contextual consistency, a Context-Aware Module leverages acoustic and visual context attention to prevent semantic clashes. Additionally, we introduce Sync-Preserving Training and Guidance (SPTG) to enhance alignment without degrading quality. Due to the scarcity of paired data, we construct a scalable data pipeline and a large-scale subject-level dataset. We also propose SpongeBob-Bench for systematic evaluation. Experiments show SpongeBob significantly outperforms existing baselines, improving Sync-C by 30% and Ctx-F1 by 12.5%. Our project page is available at: https://hy-spongebob.github.io/.

Problem

Research questions and friction points this paper is trying to address.

audio-visual desynchronization

contextual conflicts

cross-modal interaction

video editing

synchronization

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-visual synchronization

cross-modal interaction

context-aware generation