A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language Models

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the instability and inconsistency of attribution-based explanations in large language models (LLMs) within clinical neuroscience, which often arise from representational polysemy. To mitigate this issue, the authors propose a unified framework that integrates attribution with mechanistic interpretability by constructing and optimizing a monosemantic embedding space at specific LLM layers. This approach explicitly disentangles semantic features, thereby reducing variability across different attribution methods and yielding stable importance scores aligned with input–output mappings. By combining monosemantic representation learning, attribution ensembling, and layer-wise feature optimization, the method significantly enhances the reliability and reproducibility of explanations in tasks such as Alzheimer’s disease progression diagnosis, offering a trustworthy foundation for the safe deployment of LLMs in cognitive health applications.

Technology Category

Application Category

📝 Abstract

Interpretability remains a key challenge for deploying large language models (LLMs) in clinical settings such as Alzheimer's disease progression diagnosis, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an LLM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features via a decompressed representation of the layer of interest, advancing the safe and trustworthy application of LLMs in cognitive health and neurodegenerative disease.

Problem

Research questions and friction points this paper is trying to address.

interpretability

large language models

attribution

clinical neuroscience

Alzheimer's disease

Innovation

Methods, ideas, or system contributions that make the work stand out.

monosemantic

attribution

interpretability