CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the time-consuming and repetitive nature of manual video editing by proposing the first multi-agent framework tailored for automatic long-form video summarization. The approach leverages collaborative multimodal large language models (MLLMs) to hierarchically decompose input videos into semantic structures, while a dedicated script agent orchestrates narrative coherence. Concurrently, editing and review agents jointly optimize visual content selection, ensuring both global alignment with musical beats and fine-grained fidelity to semantic storytelling. Experimental results demonstrate that the generated short videos significantly outperform state-of-the-art methods in terms of rhythmic synchronization, visual aesthetics, and semantic coherence.

Technology Category

Application Category

📝 Abstract

Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.

Problem

Research questions and friction points this paper is trying to address.

video editing

music synchronization

long-form video

autonomous editing

multimodal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Video Editing

Music Synchronization

Multimodal Language Models