🤖 AI Summary
Joint audio-video generation faces three core challenges: output quality, cross-modal synchronization, and long-term temporal consistency. To address these, we propose the first framework enabling infinite-duration, high-fidelity, and strictly aligned audio-video generation. Our approach leverages a Transformer architecture enhanced with a novel rolling flow-matching mechanism to model continuous spatiotemporal distributions, and incorporates a lightweight temporal fusion module for efficient cross-modal interaction. We further introduce three purpose-built cross-modal interaction modules—designed to jointly optimize alignment accuracy and computational efficiency. Extensive experiments on multiple audio-video generation benchmarks demonstrate substantial improvements over state-of-the-art methods, achieving superior perceptual fidelity, precise audio-visual synchronization, and robust long-range temporal coherence. The source code and pre-trained models are publicly released.
📝 Abstract
Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present $^R$-FLAV, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that $^R$-FLAV outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.