🤖 AI Summary
To address the challenge of automatically identifying optical-defect–related retinal biomarkers in OCT images, this paper proposes a dual-model complementary attention detection framework that integrates both local and global structural priors. We innovatively design a synergistic architecture combining MaxViT (convolutional layers with stride-aware attention) and EVA-02 (a pure vision Transformer), augmented by a cross-model complementary attention fusion mechanism. Furthermore, we introduce bidirectional knowledge distillation—first applied in this context—which enables a single distilled model to outperform the ensemble: the distilled model achieves patient-level F1 = 0.8527 on the IEEE VIP Cup 2023 OCT track, surpassing the runner-up by 3.8%, while accelerating inference by 2.1× and reducing parameters by 64%. This approach balances diagnostic robustness and deployment efficiency, offering an interpretable, lightweight solution for clinical OCT-assisted interpretation.
📝 Abstract
Optical Coherence Tomography (OCT) scan yields all possible cross-section images of a retina for detecting biomarkers linked to optical defects. Due to the high volume of data generated, an automated and reliable biomarker detection pipeline is necessary as a primary screening stage. We outline our new state-of-the-art pipeline for identifying biomarkers from OCT scans. In collaboration with trained ophthalmologists, we identify local and global structures in biomarkers. Through a comprehensive and systematic review of existing vision architectures, we evaluate different convolution and attention mechanisms for biomarker detection. We find that MaxViT, a hybrid vision transformer combining convolution layers with strided attention, is better suited for local feature detection, while EVA-02, a standard vision transformer leveraging pure attention and large-scale knowledge distillation, excels at capturing global features. We ensemble the predictions of both models to achieve first place in the IEEE Video and Image Processing Cup 2023 competition on OCT biomarker detection, achieving a patient-wise F1 score of 0.8527 in the final phase of the competition, scoring 3.8% higher than the next best solution. Finally, we used knowledge distillation to train a single MaxViT to outperform our ensemble at a fraction of the computation cost.