FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Text-driven video editing suffers from temporal inconsistency and structural distortion in training-free methods due to latent-space video inversion. This paper introduces the first ODE-flow-guided editing paradigm that operates directly in data space—bypassing video inversion entirely—and models video evolution as a continuous dynamical process. Key contributions include: (1) an attention-modulated velocity field control mechanism for precise local motion modeling; (2) a semantic alignment guidance strategy based on differential signals to strengthen instruction-content consistency; and (3) a novel Classifier-Free Guidance variant integrating attention-guided masks with enhanced flow-based guidance. While preserving background content, our method significantly improves instruction adherence, temporal coherence, and structural fidelity. It achieves state-of-the-art performance across multiple quantitative and qualitative metrics.

Technology Category

Application Category

📝 Abstract

Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.

Problem

Research questions and friction points this paper is trying to address.

Addresses temporal inconsistencies in text-to-video editing

Enhances structural fidelity without inversion-based techniques

Improves semantic alignment with localized controllable edits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inversion-free ODE-based video editing framework

Attention-guided masking for localized edits

Guidance-enhanced editing with differential signals

🔎 Similar Papers

RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing

2023-12-19Citations: 1

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

2024-08-01arXiv.orgCitations: 4

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence