Leveraging Complementary Attention maps in vision transformers for OCT image analysis

📅 2023-10-21

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the challenge of automatically identifying optical-defect–related retinal biomarkers in OCT images, this paper proposes a dual-model complementary attention detection framework that integrates both local and global structural priors. We innovatively design a synergistic architecture combining MaxViT (convolutional layers with stride-aware attention) and EVA-02 (a pure vision Transformer), augmented by a cross-model complementary attention fusion mechanism. Furthermore, we introduce bidirectional knowledge distillation—first applied in this context—which enables a single distilled model to outperform the ensemble: the distilled model achieves patient-level F1 = 0.8527 on the IEEE VIP Cup 2023 OCT track, surpassing the runner-up by 3.8%, while accelerating inference by 2.1× and reducing parameters by 64%. This approach balances diagnostic robustness and deployment efficiency, offering an interpretable, lightweight solution for clinical OCT-assisted interpretation.

📝 Abstract

Optical Coherence Tomography (OCT) scan yields all possible cross-section images of a retina for detecting biomarkers linked to optical defects. Due to the high volume of data generated, an automated and reliable biomarker detection pipeline is necessary as a primary screening stage. We outline our new state-of-the-art pipeline for identifying biomarkers from OCT scans. In collaboration with trained ophthalmologists, we identify local and global structures in biomarkers. Through a comprehensive and systematic review of existing vision architectures, we evaluate different convolution and attention mechanisms for biomarker detection. We find that MaxViT, a hybrid vision transformer combining convolution layers with strided attention, is better suited for local feature detection, while EVA-02, a standard vision transformer leveraging pure attention and large-scale knowledge distillation, excels at capturing global features. We ensemble the predictions of both models to achieve first place in the IEEE Video and Image Processing Cup 2023 competition on OCT biomarker detection, achieving a patient-wise F1 score of 0.8527 in the final phase of the competition, scoring 3.8% higher than the next best solution. Finally, we used knowledge distillation to train a single MaxViT to outperform our ensemble at a fraction of the computation cost.

Problem

Research questions and friction points this paper is trying to address.

Automated detection of retinal biomarkers in OCT scans

Combining local and global feature analysis for biomarker identification

Improving accuracy and efficiency in OCT image analysis pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid vision transformer combines convolution and attention

Ensemble of MaxViT and EVA-02 models

Knowledge distillation reduces computation cost significantly

🔎 Similar Papers

No similar papers found.