V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge of achieving precise temporal alignment between generated music and video events in existing text-to-music models, which typically lack fine-grained temporal control. The authors propose a novel unpaired video-to-music generation method that aligns modalities through their intrinsic event dynamics rather than relying on paired data or joint training. Specifically, pre-trained encoders extract intra-modal similarity to construct event curves for both video and music; a text-to-music model is then fine-tuned to respond to these curves, enabling video-driven music synthesis at inference time. Experiments demonstrate substantial improvements over paired-data baselines across multiple benchmarks: audio quality increases by 5–21%, semantic alignment by 13–15%, temporal synchronization by 21–52%, and dance beat alignment by 28%.

Technology Category

Application Category

📝 Abstract

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/

Problem

Research questions and friction points this paper is trying to address.

video-to-music generation

temporal alignment

zero-pair learning

event synchronization

cross-modal generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-pair learning

temporal alignment

event curves