FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current dynamic medical video generation methods struggle to simultaneously ensure spatial consistency and temporal dynamics, while Transformer-based architectures face bottlenecks including weak inter-channel interaction, high computational complexity of self-attention, and coarse-grained noise-level adaptation. To address these challenges, we propose FEAT—a novel framework featuring a full-dimensional sequential attention paradigm across space, time, and channels. FEAT introduces linear-complexity weighted key-value aggregation and global channel-wise attention to enhance modeling efficiency and cross-channel coordination. Additionally, a residual value-guided module enables pixel-level fine-grained noise conditioning. Experiments demonstrate that the lightweight variant FEAT-S achieves comparable or superior performance to Endora using only 23% of its parameters. The large-scale variant FEAT-L consistently outperforms state-of-the-art methods across multiple medical video benchmarks, exhibiting strong generalization, scalability, and inference efficiency.

Technology Category

Application Category

📝 Abstract

Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.

Problem

Research questions and friction points this paper is trying to address.

Modeling spatial-temporal dynamics in medical video generation

Reducing computational complexity in Transformer attention mechanisms

Improving denoising guidance for varying noise levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential spatial-temporal-channel attention mechanisms

Linear-complexity design for attention mechanisms

Residual value guidance module for noise adaptation

🔎 Similar Papers

No similar papers found.

Authors to Follow