CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the time-consuming and repetitive nature of manual video editing by proposing the first multi-agent framework tailored for automatic long-form video summarization. The approach leverages collaborative multimodal large language models (MLLMs) to hierarchically decompose input videos into semantic structures, while a dedicated script agent orchestrates narrative coherence. Concurrently, editing and review agents jointly optimize visual content selection, ensuring both global alignment with musical beats and fine-grained fidelity to semantic storytelling. Experimental results demonstrate that the generated short videos significantly outperform state-of-the-art methods in terms of rhythmic synchronization, visual aesthetics, and semantic coherence.
📝 Abstract
Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.
Problem

Research questions and friction points this paper is trying to address.

video editing
music synchronization
long-form video
autonomous editing
multimodal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Video Editing
Music Synchronization
Multimodal Language Models
Hierarchical Multimodal Decomposition
Autonomous Multi-Agent Framework
🔎 Similar Papers
No similar papers found.