🤖 AI Summary
Tongue image diagnosis faces challenges including fine-grained vision–semantics modeling, scarce expert annotations, severe class imbalance, and insufficient clinical interpretability. Method: We propose MIRNet—a multimodal interpretable framework featuring (i) mask autoencoder (MAE)-based self-supervised pretraining for robust feature learning; (ii) a clinician-constructed constraint graph integrated with graph attention networks (GAT) and KL-divergence regularization to model label correlations and embed clinical priors; and (iii) an asymmetric loss (ASL) with dedicated regularization to mitigate class imbalance. We further introduce TongueAtlas-4K, a large-scale tongue diagnosis dataset comprising over 4,000 high-quality expert-annotated images. Results: MIRNet achieves state-of-the-art performance on multi-label tongue diagnosis, significantly improving model interpretability, cross-scenario generalization, and clinical plausibility. The framework demonstrates strong transferability to other medical imaging diagnostic tasks.
📝 Abstract
Automated interpretation of medical images demands robust modeling of complex visual-semantic relationships while addressing annotation scarcity, label imbalance, and clinical plausibility constraints. We introduce MIRNet (Medical Image Reasoner Network), a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning. Tongue image diagnosis is a particularly challenging domain that requires fine-grained visual and semantic understanding. Our approach leverages self-supervised masked autoencoder (MAE) to learn transferable visual representations from unlabeled data; employs graph attention networks (GAT) to model label correlations through expert-defined structured graphs; enforces clinical priors via constraint-aware optimization using KL divergence and regularization losses; and mitigates imbalance using asymmetric loss (ASL) and boosting ensembles. To address annotation scarcity, we also introduce TongueAtlas-4K, a comprehensive expert-curated benchmark comprising 4,000 images annotated with 22 diagnostic labels--representing the largest public dataset in tongue analysis. Validation shows our method achieves state-of-the-art performance. While optimized for tongue diagnosis, the framework readily generalizes to broader diagnostic medical imaging tasks.