🤖 AI Summary
This work addresses the limited interpretability of medical image segmentation models, which hinders error diagnosis and robustness under data distribution shifts. The authors propose the first latent-level differential framework tailored for medical image segmentation, leveraging sparse autoencoders to extract interpretable latent variables from the internal representations of SegFormer and U-Net. By systematically analyzing representation discrepancies across architectures and datasets, they uncover how shared and population-specific latent factors influence model performance. Building on these insights, they enable causal interventions without retraining, successfully restoring segmentation accuracy in 70% of failure cases—boosting the Dice score from 39.4% to 74.2%—and substantially improving cross-dataset generalization.
📝 Abstract
Modern segmentation models achieve strong predictive performance but remain largely opaque, limiting our ability to diagnose failures, understand dataset shift, or intervene in a principled manner. We introduce Med-SegLens, a model-diffing framework that decomposes segmentation model activations into interpretable latent features using sparse autoencoders trained on SegFormer and U-Net. Through cross-architecture and cross-dataset latent alignment across healthy, adult, pediatric, and sub-Saharan African glioma cohorts, we identify a stable backbone of shared representations, while dataset shift is driven by differential reliance on population-specific latents. We show that these latents act as causal bottlenecks for segmentation failures, and that targeted latent-level interventions can correct errors and improve cross-dataset adaption without retraining, recovering performance in 70% of failure cases and improving Dice score from 39.4% to 74.2%. Our results demonstrate that latent-level model diffing provides a practical and mechanistic tool for diagnosing failures and mitigating dataset shift in segmentation models.