FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses video-driven semantic-aligned audio generation, aiming for precise alignment of generated audio with input video in both semantic content and frame-level temporal structure. We propose GRAM (Gramian Representation Alignment Metric), a multimodal alignment encoder that jointly models video, text, and audio embeddings to enable fine-grained semantic control. Our method integrates diffusion-based audio synthesis with waveform envelope conditioning and incorporates contrastive learning to optimize cross-modal representation alignment. Evaluated on the Greatest Hits dataset, our approach achieves state-of-the-art performance, significantly improving semantic consistency and temporal synchronization accuracy over existing video-to-audio generation methods. The GRAM encoder enables robust grounding of audio generation in visual semantics while preserving precise timing correspondence at the frame level. Experimental results demonstrate superior fidelity, semantic relevance, and temporal alignment compared to prior approaches.

Technology Category

Application Category

📝 Abstract

In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.

Problem

Research questions and friction points this paper is trying to address.

Generate semantically aligned audio from video input

Align multimodal embeddings across video, text, and audio

Ensure temporal synchronization between generated audio and video

Innovation

Methods, ideas, or system contributions that make the work stand out.

GRAM aligns multimodal embeddings across video, text, and audio

Diffusion-based audio synthesis uses GRAM-aligned embeddings and envelopes

Semantic conditioning enhances audio-video alignment in generation

🔎 Similar Papers

Video-to-Audio Generation with Hidden Alignment