Surrogate modeling for interpreting black-box LLMs in medical predictions

๐Ÿ“… 2026-04-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

202K/year
๐Ÿค– AI Summary
This study addresses the limited interpretability of large language models (LLMs) in medical prediction tasks, which obscures the medical knowledge encoded within them and potential biases they may harbor. To tackle this issue, the authors propose a prompt-engineering-based surrogate modeling framework that approximates the latent knowledge space of an LLM by generating large-scale inputโ€“output pairs, enabling quantitative analysis of the modelโ€™s dependence on individual input variables. This approach represents the first application of surrogate modeling to systematically deconstruct the medical knowledge structure embedded in LLMs. The framework successfully identifies erroneous associations contradicting current medical consensus and exposes race-based biases that have been scientifically discredited, thereby offering an interpretable early-warning mechanism to support the safe deployment of LLMs in clinical settings.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.
Problem

Research questions and friction points this paper is trying to address.

black-box LLMs
interpretability
surrogate modeling
medical predictions
bias detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

surrogate modeling
large language models
interpretability
medical prediction
bias detection
๐Ÿ”Ž Similar Papers
No similar papers found.
Changho Han
Changho Han
Assistant Professor, Department of Mathematics, Korea University
Mathematics: Algebraic GeometryArithmetic Geometry
Songsoo Kim
Songsoo Kim
Yonsei University College of Medicine
RadiologyMedical AI
D
Dong Won Kim
Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
Leo Anthony Celi
Leo Anthony Celi
Massachusetts Institute of Technology
J
Jaewoong Kim
Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
S
SungA Bae
Department of Cardiology, Yongin Severance Hospital, Yonsei University College of Medicine, Yongin, Republic of Korea; Center for Digital Health, Yongin Severance Hospital, Yonsei University Health System, Yongin, Republic of Korea
Dukyong Yoon
Dukyong Yoon
Department of Biomedical Systems Informatics, Yonsei University College of Medicine
Medical informaticsBio-signal dataArtificial intelligence in medicine