$^R$FLAV: Rolling Flow matching for infinite Audio Video generation

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Joint audio-video generation faces three core challenges: output quality, cross-modal synchronization, and long-term temporal consistency. To address these, we propose the first framework enabling infinite-duration, high-fidelity, and strictly aligned audio-video generation. Our approach leverages a Transformer architecture enhanced with a novel rolling flow-matching mechanism to model continuous spatiotemporal distributions, and incorporates a lightweight temporal fusion module for efficient cross-modal interaction. We further introduce three purpose-built cross-modal interaction modules—designed to jointly optimize alignment accuracy and computational efficiency. Extensive experiments on multiple audio-video generation benchmarks demonstrate substantial improvements over state-of-the-art methods, achieving superior perceptual fidelity, precise audio-visual synchronization, and robust long-range temporal coherence. The source code and pre-trained models are publicly released.

Technology Category

Application Category

📝 Abstract

Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present $^R$-FLAV, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that $^R$-FLAV outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.

Problem

Research questions and friction points this paper is trying to address.

Challenges in joint audio-video generation quality

Ensuring seamless multimodal synchronization and coherence

Achieving limitless video duration with matching audio

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based architecture for AV generation

Lightweight temporal fusion module

Cross modality interaction modules

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling

2024-10-08arXiv.orgCitations: 31