Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing video-to-audio generation methods struggle to generalize to long-duration audio synthesis when trained solely on short clips, primarily due to data scarcity in multimodal alignment and semantic mismatches between textual descriptions and video frames. To address this, this work proposes MMHNet, a multimodal hierarchical network that integrates a hierarchical architecture with a non-causal Mamba mechanism, achieving— for the first time—the capability of “short training, long inference” length generalization. The proposed approach substantially enhances long-sequence modeling and cross-modal alignment, significantly outperforming current state-of-the-art models on long-form video-to-audio generation benchmarks and successfully synthesizing coherent audio exceeding five minutes in duration.

Technology Category

Application Category

📝 Abstract

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

Problem

Research questions and friction points this paper is trying to address.

length generalization

video-to-audio generation

long-form audio

multimodal alignment

scaling challenge

Innovation

Methods, ideas, or system contributions that make the work stand out.

length generalization

video-to-audio generation

hierarchical multimodal network