NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenges of computational scalability, temporal coherence, and narrative semantic understanding in automatic music generation for long videos. The authors propose a hierarchical generation framework that leverages emotion as a dense representation of narrative logic. Specifically, a frozen vision-language model serves as a continuous emotion sensor to extract valence-arousal trajectories. A dual-branch injection mechanism is introduced: global semantic anchors govern overall musical style, while token-level emotion adapters modulate local dynamics. This approach enables, for the first time, end-to-end, fully automatic music generation for long-form videos with minimal additional computational overhead. Experimental results demonstrate state-of-the-art performance in both musical consistency and narrative alignment.

Technology Category

Application Category

📝 Abstract

Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.

Problem

Research questions and friction points this paper is trying to address.

soundtrack generation

narrative logic

temporal coherence

computational scalability

semantic blindness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Affective Control

Vision-Language Models

Valence-Arousal Trajectories