🤖 AI Summary
This work addresses the high computational cost and cross-shot consistency challenges in long-form music video generation by proposing a global planning framework formulated as a Multiple-Choice Knapsack Problem (MCKP). The approach constructs a structured persistent state incorporating character and scene priors along with a shared graph, and introduces a beat-repetition-driven visual prefix reuse strategy to maintain rhythmic coherence while substantially reducing computation. By integrating multimodal saliency estimation, dynamic programming optimization, and a hierarchical forking-and-reuse mechanism, the method achieves an optimal trade-off between perceptual quality and resource consumption under strict budgetary and rhythmic constraints, as quantified by the Cost-Quality Ratio (CQR) metric.
📝 Abstract
Generating long-horizon music videos (MVs) is frequently constrained by prohibitive computational costs and difficulty maintaining cross-shot consistency. We propose AllocMV, a hierarchical framework formulating music video synthesis as a Multiple-Choice Knapsack Problem (MCKP). AllocMV represents the video's persistent state as a compact, structured object comprising character entities, scene priors, and sharing graphs, produced by a global planner prior to realization. By estimating segment saliency from multimodal cues, a group-level MCKP solver based on dynamic programming optimally allocates resources across High-Gen, Mid-Gen, and Reuse branches. For repetitive musical motifs, we implement a divergence-based forking strategy that reuses visual prefixes to reduce costs while ensuring motif-level continuity. Evaluated via the Cost-Quality Ratio (CQR), AllocMV achieves an optimal trade-off between perceived quality and resource expenditure under strict budgetary and rhythmic constraints.