A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language Models

📅 2026-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability and inconsistency of attribution-based explanations in large language models (LLMs) within clinical neuroscience, which often arise from representational polysemy. To mitigate this issue, the authors propose a unified framework that integrates attribution with mechanistic interpretability by constructing and optimizing a monosemantic embedding space at specific LLM layers. This approach explicitly disentangles semantic features, thereby reducing variability across different attribution methods and yielding stable importance scores aligned with input–output mappings. By combining monosemantic representation learning, attribution ensembling, and layer-wise feature optimization, the method significantly enhances the reliability and reproducibility of explanations in tasks such as Alzheimer’s disease progression diagnosis, offering a trustworthy foundation for the safe deployment of LLMs in cognitive health applications.

Technology Category

Application Category

📝 Abstract
Interpretability remains a key challenge for deploying large language models (LLMs) in clinical settings such as Alzheimer's disease progression diagnosis, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an LLM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features via a decompressed representation of the layer of interest, advancing the safe and trustworthy application of LLMs in cognitive health and neurodegenerative disease.
Problem

Research questions and friction points this paper is trying to address.

interpretability
large language models
attribution
clinical neuroscience
Alzheimer's disease
Innovation

Methods, ideas, or system contributions that make the work stand out.

monosemantic
attribution
interpretability
large language models
clinical neuroscience
🔎 Similar Papers
No similar papers found.
M
Michail Mamalakis
Department of Computer Science and Technology, Cancer Research UK Cambridge Institute, University of Cambridge, United Kingdom
Tiago Azevedo
Tiago Azevedo
University of Cambridge
Cristian Cosentino
Cristian Cosentino
PhD in ICT - DIMES, University of Calabria
Large Language ModelsBig Data AnalysisSocial Media AnalysisNatural Language Processin
C
Chiara D'Ercoli
Department of Computer, Automatic and Management Engineering (DIAG), Sapienza Università di Roma, Italy
S
Subati Abulikemu
Department of Psychiatry, University of Cambridge, United Kingdom
Zhongtian Sun
Zhongtian Sun
University of Cambridge
Artificial IntelligenceRepresentation LearningGeometric Machine LearningNeuroscience
R
Richard A. I. Bethlehem
Department of Psychology, University of Cambridge, United Kingdom
Pietro Liò
Pietro Liò
Professor, University of Cambridge
AI & Comp Biology -> Medicine