🤖 AI Summary
This study addresses the challenge of aligning emotional semantics and temporal boundaries in video background music generation. Methodologically, it proposes a video-driven symbolic MIDI composition framework comprising: (1) a two-stage architecture—first, a pre-trained video emotion classifier extracts continuous valence-arousal representations, coupled with a differentiable mapping from discrete emotion categories to this latent space; and (2) an event-based MIDI encoding scheme integrated with boundary-offset temporal modeling, enabling precise alignment between shot transitions and chord changes for the first time. Experimental results demonstrate that the method outperforms all state-of-the-art approaches in subjective listening evaluations, achieving the highest mean opinion scores among both professional musicians and general listeners. It significantly improves emotional congruence and temporal coherence of generated soundtracks.
📝 Abstract
We introduce EMSYNC, a video-based symbolic music generation model that aligns music with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate and align musical chords with scene cuts. Unlike existing models, our approach retains event-based encoding, ensuring fine-grained timing control and expressive musical nuances. We also propose a mapping scheme to bridge the video emotion classifier, which produces discrete emotion categories, with the emotion-conditioned MIDI generator, which operates on continuous-valued valence-arousal inputs. In subjective listening tests, EMSYNC outperforms state-of-the-art models across all subjective metrics, for music theory-aware participants as well as the general listeners.