Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenge of video-guided Foley sound synthesis under frozen text-to-audio pretrained models. We propose a lightweight cross-modal alignment framework that freezes both the video encoder (V-JEPA2) and the audio generation backbone (Stable Audio Open DiT), and trains only a compact video-to-text cross-attention bridge. Specifically, pooled video tokens are injected after the text cross-attention layer to enforce temporal alignment and local dynamic consistency. This design preserves frozen unimodal priors, enables modular upgrades, and avoids end-to-end retraining. Evaluated on video–audio benchmarks, our method achieves state-of-the-art temporal and semantic alignment performance using less than 1% trainable parameters—while maintaining prompt controllability and high inference efficiency.

Technology Category

Application Category

📝 Abstract

Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).

Problem

Research questions and friction points this paper is trying to address.

Aligning frozen text-to-audio models with video for Foley generation

Learning cross-attention bridges between video and audio modalities

Achieving video-audio synchronization without retraining audio prior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frozen pretrained models connected via cross-attention bridge

Video tokens pooled before conditioning to stabilize training

Bridge learns audio-video synchronization without retraining audio prior

🔎 Similar Papers

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound