VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Standard attention mechanisms in Transformer-based video editing models incur O(N²) computational complexity, posing severe bottlenecks for long videos and high-resolution inputs. To address this, we propose the Linear Spatiotemporal Aggregation Model (LSAM), the first to integrate RWKV’s bidirectional weighted key-value recurrence into a diffusion-based video editing framework. LSAM preserves global spatiotemporal modeling capacity and temporal coherence while reducing complexity to O(N). Experiments demonstrate that LSAM achieves up to 3.7× speedup and 60% memory reduction over state-of-the-art methods, without compromising frame-to-frame consistency or text–vision alignment quality—maintaining SOTA performance across standard benchmarks. Notably, its advantages become pronounced for sequences exceeding 128 frames, where conventional attention-based approaches suffer from prohibitive computational overhead.

Technology Category

Application Category

📝 Abstract

In light of recent progress in video editing, deep learning models focusing on both spatial and temporal dependencies have emerged as the primary method. However, these models suffer from the quadratic computational complexity of traditional attention mechanisms, making them difficult to adapt to long-duration and high-resolution videos. This limitation restricts their applicability in practical contexts such as real-time video processing. To tackle this challenge, we introduce a method to reduce both time and space complexity of these systems by proposing VRWKV-Editor, a novel video editing model that integrates a linear spatio-temporal aggregation module into video-based diffusion models. VRWKV-Editor leverages bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while preserving temporal coherence, achieving linear complexity without sacrificing quality. Extensive experiments demonstrate that the proposed method achieves up to 3.7x speedup and 60% lower memory usage compared to state-of-the-art diffusion-based video editing methods, while maintaining competitive performance in frame consistency and text alignment. Furthermore, a comparative analysis we conducted on videos with different sequence lengths confirms that the gap in editing speed between our approach and architectures with self-attention becomes more significant with long videos.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity in transformer video editing models

Enables efficient processing of long-duration and high-resolution videos

Maintains quality while achieving linear computational complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear spatio-temporal aggregation module reduces complexity

Bidirectional weighted key-value recurrence captures global dependencies

Achieves linear complexity while maintaining video editing quality

🔎 Similar Papers

No similar papers found.