TA-V2A: Textually Assisted Video-to-Audio Generation

📅 2025-03-12
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-to-audio (V2A) generation methods rely on frame-level visual features, limiting their ability to model temporal semantics and achieve precise cross-modal alignment—resulting in semantic degradation and audio-video misalignment. To address this, we propose a language-model-driven multimodal diffusion framework: (1) a large language model (LLM) parses video semantics and generates structured textual prompts; (2) a text-modulated cross-modal feature fusion module enables fine-grained alignment of video, audio, and text representations in the latent space; and (3) an adaptive temporal modeling mechanism enhances temporal coherence of the generated audio. Evaluated on multiple benchmarks, our method achieves significant improvements in audio fidelity (+12.6% MOS) and semantic consistency (+28.3% CLAP Score), accelerates inference by 37%, and supports text-controllable and personalized audiovisual generation.

Technology Category

Application Category

📝 Abstract
As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While Transformer and Diffusion models have advanced audio generation, a significant challenge persists in extracting precise semantic information from videos, as current models often lose sequential context by relying solely on frame-based features. To address this, we present TA-V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space. By incorporating large language models for enhanced video comprehension, our approach leverages text guidance to enrich semantic expression. Our diffusion model-based system utilizes automated text modulation to enhance inference quality and efficiency, providing personalized control through text-guided interfaces. This integration enhances semantic expression while ensuring temporal alignment, leading to more accurate and coherent video-to-audio generation.
Problem

Research questions and friction points this paper is trying to address.

Improves semantic representation in video-to-audio generation.
Addresses loss of sequential context in frame-based features.
Enhances inference quality with text-guided interfaces.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates language, audio, and video features
Uses large language models for video comprehension
Employs diffusion model with text-guided interfaces
🔎 Similar Papers
2024-09-13IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 4
Y
Yuhuan You
State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, Beijing, China
Xihong Wu
Xihong Wu
Peking University
Machine learningSpeech signal processingArtificial intelligence
T
T. Qu
State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, Beijing, China