🤖 AI Summary
This work addresses the challenges of high computational cost and difficulty in maintaining global temporal consistency across thousands of frames in minute-scale long video editing. The authors propose a training-free, divide-and-conquer optical flow framework that segments the video for localized editing. To mitigate boundary flickering between adjacent segments, a Velocity Blend module fuses motion information from neighboring clips. Furthermore, an Attention Sink mechanism anchors global reference features to effectively suppress structural drift. Experimental results demonstrate that the proposed method significantly outperforms existing approaches in both temporal stability and semantic fidelity, enabling efficient and high-quality editing of long videos.
📝 Abstract
We propose MLV-Edit, a training-free, flow-based framework that address the unique challenges of minute-level video editing. While existing techniques excel in short-form video manipulation, scaling them to long-duration videos remains challenging due to prohibitive computational overhead and the difficulty of maintaining global temporal consistency across thousands of frames. To address this, MLV-Edit employs a divide-and-conquer strategy for segment-wise editing, facilitated by two core modules: Velocity Blend rectifies motion inconsistencies at segment boundaries by aligning the flow fields of adjacent chunks, eliminating flickering and boundary artifacts commonly observed in fragmented video processing; and Attention Sink anchors local segment features to global reference frames, effectively suppressing cumulative structural drift. Extensive quantitative and qualitative experiments demonstrate that MLV-Edit consistently outperforms state-of-the-art methods in terms of temporal stability and semantic fidelity.