MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address semantic ambiguity and insufficient clinical interpretability in medical vision-language models (e.g., MedCLIP), this paper proposes the Medical Sparse Autoencoder (MedSAE), enabling fine-grained neuron-level analysis within the MedCLIP latent space trained on chest X-ray–report alignment. Methodologically, MedSAE integrates sparse coding, multimodal representation learning, and automated neuron naming guided by the MedGEMMA large language model, underpinned by a quantitative evaluation framework assessing three dimensions: neuron–concept correlation, activation entropy, and semantic consistency. Experiments on CheXpert demonstrate that MedSAE significantly improves neuronal mono-semantics (+32.7%) and clinical interpretability (expert interpretability scores ↑41.5%). It represents the first approach to achieve verifiable, nameable, and traceable decomposition of MedCLIP’s high-level representations—establishing a novel paradigm for enhancing trustworthiness and interpretability in medical AI.

Technology Category

Application Category

📝 Abstract

Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.

Problem

Research questions and friction points this paper is trying to address.

Dissecting medical vision-language model representations for interpretability

Quantifying interpretability through correlation metrics and entropy analysis

Improving monosemanticity of medical AI features for clinical reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders analyze MedCLIP latent space

Evaluation framework combines metrics and neuron naming

Neurons achieve higher monosemanticity than raw features

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models