FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing

πŸ“… 2025-12-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address temporal inconsistency and high computational cost in video editing, this paper proposes MambaEditβ€”a single-pass, efficient video editing method. Methodologically, it integrates a temporally aware linear state space model (Mamba) into a pre-trained Stable Diffusion framework, replacing costly spatiotemporal attention mechanisms. It is the first to leverage Mamba for modeling global inter-frame dependencies, combining low-rank query/key matrices with a weighted attention score update strategy to enable lightweight, causal, and long-range temporal modeling. Experimentally, on real-world video editing tasks involving multiple attributes (subject, scene, and action), MambaEdit significantly improves temporal consistency, achieves a 3.2Γ— speedup in inference latency, reduces GPU memory consumption by 41%, and outperforms existing state-of-the-art methods in editing quality.

Technology Category

Application Category

πŸ“ Abstract
Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.
Problem

Research questions and friction points this paper is trying to address.

Addresses temporal inconsistency in video editing
Reduces computational overhead in video editing models
Enhances video attribute, subject, and location editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Mamba for global frame-level attention
Uses low-rank matrices to replace query and key weights
Employs weighted averaging to update attention scores
πŸ”Ž Similar Papers
No similar papers found.
M
Mingshu Cai
Waseda University, Japan
Y
Yixuan Li
School of Computer Science and Engineering, Southeast University, China
Osamu Yoshie
Osamu Yoshie
waseda university
Y
Yuya Ieiri
Waseda University, Japan