🤖 AI Summary
This work addresses cross-modal video-to-audio generation with emphasis on semantic consistency and frame-level temporal alignment. To this end, we propose an alignment-aware framework featuring: (i) a lightweight visual encoder for efficient video representation extraction; (ii) learnable auxiliary embeddings that explicitly model audio–video correspondence; and (iii) multi-scale temporal data augmentation coupled with end-to-end joint training to enforce temporal coherence. Our key innovation lies in an implicit alignment mechanism, which reveals the critical role of auxiliary embeddings and augmentation strategies in achieving precise synchronization. We further introduce the first comprehensive evaluation paradigm specifically designed for audio–video alignment. Experiments demonstrate state-of-the-art performance in both audio fidelity—measured by STFT-L1 and PESQ—and frame-level synchronization accuracy—quantified by SyncScore—establishing a new benchmark for photorealistic audiovisual generation.
📝 Abstract
Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techniques. Beginning with a foundational model built on a simple yet surprisingly effective intuition, we explore various vision encoders and auxiliary embeddings through ablation studies. Employing a comprehensive evaluation pipeline that emphasizes generation quality and video-audio synchronization alignment, we demonstrate that our model exhibits state-of-the-art video-to-audio generation capabilities. Furthermore, we provide critical insights into the impact of different data augmentation methods on enhancing the generation framework's overall capacity. We showcase possibilities to advance the challenge of generating synchronized audio from semantic and temporal perspectives. We hope these insights will serve as a stepping stone toward developing more realistic and accurate audio-visual generation models.