π€ AI Summary
Current dynamic medical video generation methods struggle to simultaneously ensure spatial consistency and temporal dynamics, while Transformer-based architectures face bottlenecks including weak inter-channel interaction, high computational complexity of self-attention, and coarse-grained noise-level adaptation. To address these challenges, we propose FEATβa novel framework featuring a full-dimensional sequential attention paradigm across space, time, and channels. FEAT introduces linear-complexity weighted key-value aggregation and global channel-wise attention to enhance modeling efficiency and cross-channel coordination. Additionally, a residual value-guided module enables pixel-level fine-grained noise conditioning. Experiments demonstrate that the lightweight variant FEAT-S achieves comparable or superior performance to Endora using only 23% of its parameters. The large-scale variant FEAT-L consistently outperforms state-of-the-art methods across multiple medical video benchmarks, exhibiting strong generalization, scalability, and inference efficiency.
π Abstract
Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.