🤖 AI Summary
This work addresses the problem of automatically generating cinematic edit videos from static wide-angle footage. Methodologically, we propose an end-to-end framework: (1) synthesizing candidate shot streams via multi-virtual-camera rendering; (2) jointly leveraging large language model (LLM)-driven dialogue understanding and visual saliency prediction for semantic- and perception-aware shot evaluation; and (3) formulating cinematic grammar constraints—e.g., match cuts and gaze continuity—as a differentiable energy minimization problem to optimize the final shot sequence. Our key contributions are twofold: first, the novel integration of LLM-based dialogue modeling with visual saliency for shot selection; second, the explicit, differentiable encoding of cinematic grammar as an energy function. Quantitative and psychophysical evaluations (N=20) on the BBC Old School dataset and 11 theatrical films demonstrate that our method significantly outperforms existing baselines in perceived quality, narrative coherence, and immersion.
📝 Abstract
We present EditIQ, a completely automated framework for cinematically editing scenes captured via a stationary, large field-of-view and high-resolution camera. From the static camera feed, EditIQ initially generates multiple virtual feeds, emulating a team of cameramen. These virtual camera shots termed rushes are subsequently assembled using an automated editing algorithm, whose objective is to present the viewer with the most vivid scene content. To understand key scene elements and guide the editing process, we employ a two-pronged approach: (1) a large language model (LLM)-based dialogue understanding module to analyze conversational flow, coupled with (2) visual saliency prediction to identify meaningful scene elements and camera shots therefrom. We then formulate cinematic video editing as an energy minimization problem over shot selection, where cinematic constraints determine shot choices, transitions, and continuity. EditIQ synthesizes an aesthetically and visually compelling representation of the original narrative while maintaining cinematic coherence and a smooth viewing experience. Efficacy of EditIQ against competing baselines is demonstrated via a psychophysical study involving twenty participants on the BBC Old School dataset plus eleven theatre performance videos. Video samples from EditIQ can be found at https://editiq-ave.github.io/.