🤖 AI Summary
To address semantic ambiguity and insufficient clinical interpretability in medical vision-language models (e.g., MedCLIP), this paper proposes the Medical Sparse Autoencoder (MedSAE), enabling fine-grained neuron-level analysis within the MedCLIP latent space trained on chest X-ray–report alignment. Methodologically, MedSAE integrates sparse coding, multimodal representation learning, and automated neuron naming guided by the MedGEMMA large language model, underpinned by a quantitative evaluation framework assessing three dimensions: neuron–concept correlation, activation entropy, and semantic consistency. Experiments on CheXpert demonstrate that MedSAE significantly improves neuronal mono-semantics (+32.7%) and clinical interpretability (expert interpretability scores ↑41.5%). It represents the first approach to achieve verifiable, nameable, and traceable decomposition of MedCLIP’s high-level representations—establishing a novel paradigm for enhancing trustworthiness and interpretability in medical AI.
📝 Abstract
Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.