FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-driven video editing suffers from temporal inconsistency and structural distortion in training-free methods due to latent-space video inversion. This paper introduces the first ODE-flow-guided editing paradigm that operates directly in data space—bypassing video inversion entirely—and models video evolution as a continuous dynamical process. Key contributions include: (1) an attention-modulated velocity field control mechanism for precise local motion modeling; (2) a semantic alignment guidance strategy based on differential signals to strengthen instruction-content consistency; and (3) a novel Classifier-Free Guidance variant integrating attention-guided masks with enhanced flow-based guidance. While preserving background content, our method significantly improves instruction adherence, temporal coherence, and structural fidelity. It achieves state-of-the-art performance across multiple quantitative and qualitative metrics.

Technology Category

Application Category

📝 Abstract
Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.
Problem

Research questions and friction points this paper is trying to address.

Addresses temporal inconsistencies in text-to-video editing
Enhances structural fidelity without inversion-based techniques
Improves semantic alignment with localized controllable edits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inversion-free ODE-based video editing framework
Attention-guided masking for localized edits
Guidance-enhanced editing with differential signals
G
Guangzhao Li
AGI Lab, Westlake University; Central South University
Yanming Yang
Yanming Yang
Westlake University
3D Vision
Chenxi Song
Chenxi Song
Westlake University & Jilin University
3DVIsion3D&4D Generation&Reconstruction
C
Chi Zhang
AGI Lab, Westlake University