🤖 AI Summary
This work addresses the automation of nonlinear video editing driven by natural language instructions. We propose a dual-agent collaborative framework: an Editor agent performs editing operations, while a Critic agent—built upon a large language model (LLM)—provides human-preference-aligned, closed-loop evaluation feedback. Editing is formulated as a sequential decision-making task, integrating LLMs’ semantic understanding and tool-use capabilities with reinforcement learning principles. A learned multi-agent communication mechanism is designed to emulate professional editing behavior. Compared to prior approaches, our method achieves significant improvements in clip coverage, temporal constraint satisfaction, and user preference scores. A user study confirms that generated video sequences are both high-quality and practically useful. Our core contribution is the first instantiation of a language–action–evaluation closed-loop paradigm for multi-agent video editing, and the empirical validation of LLMs as learnable, preference-aware evaluators.
📝 Abstract
Automated tools for video editing and assembly have applications ranging from filmmaking and advertisement to content creation for social media. Previous video editing work has mainly focused on either retrieval or user interfaces, leaving actual editing to the user. In contrast, we propose to automate the core task of video editing, formulating it as sequential decision making process. Ours is a multi-agent approach. We design an Editor agent and a Critic agent. The Editor takes as input a collection of video clips together with natural language instructions and uses tools commonly found in video editing software to produce an edited sequence. On the other hand, the Critic gives natural language feedback to the editor based on the produced sequence or renders it if it is satisfactory. We introduce a learning-based approach for enabling effective communication across specialized agents to address the language-driven video editing task. Finally, we explore an LLM-as-a-judge metric for evaluating the quality of video editing system and compare it with general human preference. We evaluate our system's output video sequences qualitatively and quantitatively through a user study and find that our system vastly outperforms existing approaches in terms of coverage, time constraint satisfaction, and human preference.