TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This paper addresses the challenges of insufficient audio fidelity and temporal consistency in video-to-audio synthesis. We propose a synchronized generation method tailored to dynamic visual events. Our approach is built upon a streaming Transformer architecture and introduces two key innovations: (1) timestep-adaptive representation alignment (TRA), which ensures smooth latent-space evolution under noise scheduling; and (2) onset-aware conditioning (OAC), enabling precise modeling of audio onset timing driven by visual input. The method integrates noise-schedule-aware alignment, onset-driven visual encoding, and continuous probabilistic modeling. Evaluated on VGGSound and Landscape datasets, our method reduces Frechet Distance (FD) by 53% and Fréchet Audio Distance (FAD) by 29%, while achieving 97.19% audio-visual temporal alignment accuracy—demonstrating substantial improvements in audio quality realism and cross-modal temporal coherence.

Technology Category

Application Category

📝 Abstract

This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.

Problem

Research questions and friction points this paper is trying to address.

Achieving high-fidelity synchronized video-to-audio synthesis

Dynamically aligning latent representations for improved fidelity

Enhancing synchronization with onset-aware visual event cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow-based transformers for stable training

Timestep-Adaptive Representation Alignment (TRA)

Onset-Aware Conditioning (OAC) for synchronization

🔎 Similar Papers

Video-to-Audio Generation with Hidden Alignment