🤖 AI Summary
Existing methods for automatic movie trailer generation exhibit overly rigid alignment between music and shot rhythms, failing to replicate the elastic temporal dynamics characteristic of professional editing. To address this limitation, this work proposes a five-stage intelligent agent framework that integrates a cross-modal MuVA alignment encoder and a novel Bar-DP energy-adaptive dynamic programming algorithm, enabling music-driven many-to-one elastic shot alignment. The framework further incorporates structured textual signals to guide high-level creative decisions. Leveraging a two-stage training strategy with Sinkhorn regularization, the system performs end-to-end trailer generation on the newly introduced TrailerArena benchmark, achieving state-of-the-art performance across multiple metrics—including shot selection, sequencing, and perceptual quality.
📝 Abstract
Automatic movie trailer generation must select shots from a full-length film and synchronize them with background music. Existing methods either relegate music alignment to post-processing or enforce rigid one-to-one shot-music mappings, overlooking that professional editing rhythm is elastic: rapid cuts accompany high-energy passages while sustained shots span quieter bars. We introduce BEAT, a framework that addresses this gap with two core components: MuVA, a compact music-visual alignment encoder trained with Sinkhorn-regularized two-stage learning, and Bar-DP, an energy-adaptive dynamic programming algorithm that produces elastic many-to-one alignments following musical dynamics. These components are integrated into a five-phase agentic pipeline that grounds the core alignment in learned cross-modal features while coordinating higher-level creative decisions through structured text signals. To support comprehensive evaluation, we also introduce TrailerArena, a benchmark with 20+ metrics across four complementary dimensions. On TrailerArena, BEAT achieves state-of-the-art performance across shot selection, ordering, and perceptual quality, while producing fully composed trailers end-to-end.