A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Clinical deep learning models face significant challenges in high-stakes settings due to the lack of reliable and generalizable interpretability, hindering their validation and deployment. This work presents the first systematic evaluation of multiple interpretability methods—including attention mechanisms, KernelSHAP, and LIME—across diverse deep temporal architectures and multitask clinical prediction scenarios. The study demonstrates that, when appropriately applied, attention mechanisms offer both computational efficiency and faithfulness to the underlying model, whereas KernelSHAP and LIME suffer from intractable computational demands or insufficient reliability in temporal tasks. Built upon the PyHealth framework, the project establishes the first reproducible and extensible benchmark for clinical interpretability and provides practical guidelines to inform future research, with all code publicly released to foster community advancement.

Technology Category

Application Category

📝 Abstract
Clinical decisions are high-stakes and require explicit justification, making model interpretability essential for auditing deep clinical models prior to deployment. As the ecosystem of model architectures and explainability methods expands, critical questions remain: Do architectural features like attention improve explainability? Do interpretability approaches generalize across clinical tasks? While prior benchmarking efforts exist, they often lack extensibility and reproducibility, and critically, fail to systematically examine how interpretability varies across the interplay of clinical tasks and model architectures. To address these gaps, we present a comprehensive benchmark evaluating interpretability methods across diverse clinical prediction tasks and model architectures. Our analysis reveals that: (1) attention when leveraged properly is a highly efficient approach for faithfully interpreting model predictions; (2) black-box interpreters like KernelSHAP and LIME are computationally infeasible for time-series clinical prediction tasks; and (3) several interpretability approaches are too unreliable to be trustworthy. From our findings, we discuss several guidelines on improving interpretability within clinical predictive pipelines. To support reproducibility and extensibility, we provide our implementations via PyHealth, a well-documented open-source framework: https://github.com/sunlabuiuc/PyHealth.
Problem

Research questions and friction points this paper is trying to address.

interpretability
time-series
clinical prediction
model architecture
reproducibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretability
time-series clinical prediction
attention mechanism
reproducibility
model auditing
🔎 Similar Papers
No similar papers found.
Y
Yongda Fan
University of Illinois Urbana-Champaign, Champaign, IL 61801, USA; PyHealth
J
John Wu
University of Illinois Urbana-Champaign, Champaign, IL 61801, USA; PyHealth
A
Andrea Fitzpatrick
University of Illinois Urbana-Champaign, Champaign, IL 61801, USA; PyHealth
N
Naveen Baskaran
University of Illinois Urbana-Champaign, Champaign, IL 61801, USA; PyHealth
Jimeng Sun
Jimeng Sun
Professor at University of Illinois Urbana-Champaign
AI for healthcareMachine learning for healthcaredeep learning for healthcare
A
Adam Cross
University of Illinois College of Medicine, Chicago, IL 60612, USA