FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing video generation methods struggle to balance global spatiotemporal modeling with computational efficiency. This work proposes a frame-level matrix attention mechanism that treats entire video frames as native matrices within a diffusion Transformer (DiT) architecture, enabling cross-frame—rather than cross-token—attention computation and thereby overcoming the limitations of conventional token-level attention. The approach effectively captures large-scale motion while preserving fine-grained local details, and further integrates localized factorized attention to enable multi-scale motion modeling. Experimental results demonstrate that the proposed method achieves state-of-the-art performance across multiple video generation benchmarks, significantly improving temporal coherence and visual quality while maintaining favorable computational efficiency.

Technology Category

Application Category

📝 Abstract

High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.

Problem

Research questions and friction points this paper is trying to address.

video generation

diffusion models

spatio-temporal dynamics

temporal coherence

high-fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Matrix Attention

Frame-Level Attention

Diffusion Transformer