π€ AI Summary
To address the challenge of efficient long-video-to-short-video generation, this paper proposes Lotus, a synergistic summarization framework that integrates abstractive and extractive approaches: abstractive summarization via LLM-generated concise scripts followed by TTS-based speech synthesis, and extractive retrieval of semantically aligned video segments using cross-modal speechβvideo matching. Its key innovation is the first realization of fine-grained semantic alignment between generated speech and original video visual segments, enabling human-in-the-loop interactive editing. Experimental results demonstrate that, compared to purely extractive baselines, Lotus improves information density in generated short videos by 42%, reduces authoring time by 57%, and preserves original audiovisual consistency significantly.
π Abstract
Short-form videos are popular on platforms like TikTok and Instagram as they quickly capture viewers' attention. Many creators repurpose their long-form videos to produce short-form videos, but creators report that planning, extracting, and arranging clips from long-form videos is challenging. Currently, creators make extractive short-form videos composed of existing long-form video clips or abstractive short-form videos by adding newly recorded narration to visuals. While extractive videos maintain the original connection between audio and visuals, abstractive videos offer flexibility in selecting content to be included in a shorter time. We present Lotus, a system that combines both approaches to balance preserving the original content with flexibility over the content. Lotus first creates an abstractive short-form video by generating both a short-form script and its corresponding speech, then matching long-form video clips to the generated narration. Creators can then add extractive clips with an automated method or Lotus's editing interface. Lotus's interface can be used to further refine the short-form video. We compare short-form videos generated by Lotus with those using an extractive baseline method. In our user study, we compare creating short-form videos using Lotus to participants' existing practice.