🤖 AI Summary
This work addresses the challenge of multi-label fundus image diagnosis, which requires simultaneous modeling of fine-grained lesions and large-scale retinal structures. Conventional multi-scale approaches rely on explicit frequency-domain decomposition, suffering from limited performance gains and computational inefficiency. To overcome these limitations, the authors propose Clifford-M, a lightweight backbone network that eschews handcrafted frequency engineering and instead leverages Clifford algebra-based rolling convolutions to jointly model alignment and structural variations with linear complexity. Embedded within a compact dual-resolution architecture, this design enables intrinsically decoupled cross-scale feature interaction. With only 0.85 million parameters, Clifford-M achieves an average AUC-ROC of 0.8142 and macro-F1 of 0.5481 on ODIR-5K, and further attains macro AUC of 0.7425 and micro AUC of 0.7610 on RFMiD without fine-tuning—significantly outperforming larger CNN baselines.
📝 Abstract
Multi-label fundus diagnosis requires features that capture both fine-grained lesions and large-scale retinal structure. Many multi-scale medical vision models address this challenge through explicit frequency decomposition, but our ablation studies show that such heuristics provide limited benefit in this setting: replacing the proposed simple dual-resolution stem with Octave Convolution increased parameters by 35% and computation by a 2.23-fold increase in computation; without improving mean accuracy, while a fixed wavelet-based variant performed substantially worse.
Motivated by these findings, we propose Clifford-M, a lightweight backbone that replaces both feed-forward expansion and frequency-splitting modules with sparse geometric interaction. The model is built on a Clifford-style rolling product that jointly captures alignment and structural variation with linear complexity, enabling efficient cross-scale fusion and self-refinement in a compact dual-resolution architecture. Without pre-training, Clifford-M achieves a mean AUC-ROC of 0.8142 and a mean macro-F1 (optimal threshold) of 0.5481 on ODIR-5K using only 0.85M parameters, outperforming substantially larger mid-scale CNN baselines under the same training protocol. When evaluated on RFMiD without fine-tuning, it attains 0.7425 +/- 0.0198 macro AUC and 0.7610 +/- 0.0344 micro AUC, indicating reasonable robustness to cross-dataset shift.
These results suggest that competitive and efficient fundus diagnosis can be achieved without explicit frequency engineering, provided that the core feature interaction is designed to capture multi-scale structure directly.