FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

πŸ“… 2025-06-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current dynamic medical video generation methods struggle to simultaneously ensure spatial consistency and temporal dynamics, while Transformer-based architectures face bottlenecks including weak inter-channel interaction, high computational complexity of self-attention, and coarse-grained noise-level adaptation. To address these challenges, we propose FEATβ€”a novel framework featuring a full-dimensional sequential attention paradigm across space, time, and channels. FEAT introduces linear-complexity weighted key-value aggregation and global channel-wise attention to enhance modeling efficiency and cross-channel coordination. Additionally, a residual value-guided module enables pixel-level fine-grained noise conditioning. Experiments demonstrate that the lightweight variant FEAT-S achieves comparable or superior performance to Endora using only 23% of its parameters. The large-scale variant FEAT-L consistently outperforms state-of-the-art methods across multiple medical video benchmarks, exhibiting strong generalization, scalability, and inference efficiency.

Technology Category

Application Category

πŸ“ Abstract
Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.
Problem

Research questions and friction points this paper is trying to address.

Modeling spatial-temporal dynamics in medical video generation
Reducing computational complexity in Transformer attention mechanisms
Improving denoising guidance for varying noise levels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential spatial-temporal-channel attention mechanisms
Linear-complexity design for attention mechanisms
Residual value guidance module for noise adaptation
πŸ”Ž Similar Papers
No similar papers found.
H
Huihan Wang
School of Biological Science and Medical Engineering, State Key Laboratory of Software Development Environment, Key Laboratory of Biomechanics and Mechanobiology of Ministry of Education, Beijing Advanced Innovation Center for Biomedical Engineering, Beihang University, Beijing 100191, China
Zhiwen Yang
Zhiwen Yang
Beihang University
Low-level VisionAIGCMedical Image Analysis
H
Hui Zhang
Department of Biomedical Engineering, Tsinghua University, Beijing 100084, China
D
Dan Zhao
Department of Gynecology Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
B
Bingzheng Wei
ByteDance Inc., Beijing 100098, China
Y
Yan Xu
School of Biological Science and Medical Engineering, State Key Laboratory of Software Development Environment, Key Laboratory of Biomechanics and Mechanobiology of Ministry of Education, Beijing Advanced Innovation Center for Biomedical Engineering, Beihang University, Beijing 100191, China