🤖 AI Summary
This work addresses video-driven semantic-aligned audio generation, aiming for precise alignment of generated audio with input video in both semantic content and frame-level temporal structure. We propose GRAM (Gramian Representation Alignment Metric), a multimodal alignment encoder that jointly models video, text, and audio embeddings to enable fine-grained semantic control. Our method integrates diffusion-based audio synthesis with waveform envelope conditioning and incorporates contrastive learning to optimize cross-modal representation alignment. Evaluated on the Greatest Hits dataset, our approach achieves state-of-the-art performance, significantly improving semantic consistency and temporal synchronization accuracy over existing video-to-audio generation methods. The GRAM encoder enables robust grounding of audio generation in visual semantics while preserving precise timing correspondence at the frame level. Experimental results demonstrate superior fidelity, semantic relevance, and temporal alignment compared to prior approaches.
📝 Abstract
In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system's ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.