🤖 AI Summary
Text-driven video editing suffers from temporal inconsistency and structural distortion in training-free methods due to latent-space video inversion. This paper introduces the first ODE-flow-guided editing paradigm that operates directly in data space—bypassing video inversion entirely—and models video evolution as a continuous dynamical process. Key contributions include: (1) an attention-modulated velocity field control mechanism for precise local motion modeling; (2) a semantic alignment guidance strategy based on differential signals to strengthen instruction-content consistency; and (3) a novel Classifier-Free Guidance variant integrating attention-guided masks with enhanced flow-based guidance. While preserving background content, our method significantly improves instruction adherence, temporal coherence, and structural fidelity. It achieves state-of-the-art performance across multiple quantitative and qualitative metrics.
📝 Abstract
Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.