STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

📅 2024-09-13
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
To address audio absence, semantic inconsistency, and temporal misalignment in text-to-video generation, this paper proposes the first text-guided video-to-audio generation framework. Methodologically: (1) we design a latent diffusion model (LDM)-based architecture integrating a video temporal encoder, a semantic attention pooling module, and text cross-attention; (2) we introduce an onset-aware pretraining objective to enhance beat-level temporal modeling; (3) we propose a novel text-prior initialization strategy to improve cross-modal consistency. Contributions include: (i) introducing Audio-Audio Align—a new evaluation metric for temporal alignment—and (ii) achieving state-of-the-art performance in both objective and subjective evaluations, with significant improvements in audio fidelity, semantic accuracy, and temporal alignment precision. Ablation studies validate the efficacy of each component.

Technology Category

Application Category

📝 Abstract
Visual and auditory perception are two crucial ways humans experience the world. Text-to-video generation has made remarkable progress over the past year, but the absence of harmonious audio in generated video limits its broader applications. In this paper, we propose Semantic and Temporal Aligned Video-to-Audio (STA-V2A), an approach that enhances audio generation from videos by extracting both local temporal and global semantic video features and combining these refined video features with text as cross-modal guidance. To address the issue of information redundancy in videos, we propose an onset prediction pretext task for local temporal feature extraction and an attentive pooling module for global semantic feature extraction. To supplement the insufficient semantic information in videos, we propose a Latent Diffusion Model with Text-to-Audio priors initialization and cross-modal guidance. We also introduce Audio-Audio Align, a new metric to assess audio-temporal alignment. Subjective and objective metrics demonstrate that our method surpasses existing Video-to-Audio models in generating audio with better quality, semantic consistency, and temporal alignment. The ablation experiment validated the effectiveness of each module. Audio samples are available at https://y-ren16.github.io/STAV2A.
Problem

Research questions and friction points this paper is trying to address.

Enhances audio generation from videos using semantic and temporal alignment
Addresses video information redundancy with onset prediction and attentive pooling
Improves semantic consistency with cross-modal guidance and text-to-audio priors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts local temporal and global semantic video features
Uses cross-modal guidance combining video features with text
Employs Latent Diffusion Model with Text-to-Audio priors
🔎 Similar Papers
No similar papers found.
Yong Ren
Yong Ren
Institute of Automation, Chinese Academy of Sciences
Speech CodecText-to-speechVideo-to-audioMLLMContinual Learning
C
Chenxing Li
Tencent AI Lab, Beijing, China
Manjie Xu
Manjie Xu
Peking University
Cognitive Reasoning
W
Weihan Liang
Beijing Institute of Technology
Y
Yu Gu
Tencent AI Lab, Beijing, China
R
Rilin Chen
Tencent AI Lab, Beijing, China
D
Dong Yu
Tencent AI Lab, Seattle, USA