Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

154K/year
🤖 AI Summary
This work addresses the challenge of delivering expert-level clinical decision support for multiple myeloma patients based on decades-long, heterogeneous longitudinal electronic health records. We propose the first agent-based reasoning system tailored to longitudinal oncology records, leveraging a large language model integrated with iterative retrieval-augmented generation, multi-turn evidence synthesis, and structured clinical knowledge alignment to emulate expert decision-making processes. Evaluated on 469 complex clinical questions, the system achieves a 79.6% agreement rate with expert consensus, outperforming baseline methods by 3.8–4.2 percentage points. Notably, it demonstrates substantially improved performance—by 9.4 and 13.5 percentage points, respectively—on the most challenging questions and the longest patient records, marking the first time an AI system approaches and even exceeds the expert consensus ceiling in ultra-longitudinal clinical scenarios.

Technology Category

Application Category

📝 Abstract
Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous clinical documents. Whether LLM-based systems can synthesise this evidence at a level approaching expert agreement has not been established. A retrospective evaluation was conducted on longitudinal clinical records of 811 myeloma patients treated at a tertiary centre (2001-2026), covering 44,962 documents and 1,334,677 laboratory values, with external validation on MIMIC-IV. An agentic reasoning system was compared against single-pass retrieval-augmented generation (RAG), iterative RAG, and full-context input on 469 patient-question pairs from 48 templates at three complexity levels. Reference labels came from double annotation by four oncologists with senior haematologist adjudication. Iterative RAG and full-context input converged on a shared ceiling (75.4% vs 75.8%, p = 1.00). The agentic system reached 79.6% concordance (95% CI 76.4-82.8), exceeding both baselines (+3.8 and +4.2 pp; p = 0.006 and 0.007). Gains rose with question complexity, reaching +9.4 pp on criteria-based synthesis (p = 0.032), and with record length, reaching +13.5 pp in the top decile (n = 10). The system error rate (12.2%) was comparable to expert disagreement (13.6%), but severity was inverted: 57.8% of system errors were clinically significant versus 18.8% of expert disagreements. Agentic reasoning was the only approach to exceed the shared ceiling, with gains concentrated on the most complex questions and longest records. The greater clinical consequence of residual system errors indicates that prospective evaluation in routine care is required before these findings translate into patient benefit.
Problem

Research questions and friction points this paper is trying to address.

multiple myeloma
clinical reasoning
longitudinal records
expert consensus
LLM-based systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic reasoning
longitudinal clinical records
retrieval-augmented generation
clinical decision support
multiple myeloma
Johannes Moll
Johannes Moll
Technical University of Munich, Stanford University
J
Jannik Lübberstedt
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
C
Christoph Nuernbergk
Department of Medicine III, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
J
Jacob Stroh
Department of Medicine III, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
L
Luisa Mertens
Department of Medicine III, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
A
Anna Purcarea
Department of Medicine III, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
C
Christopher Zirn
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
Z
Zeineb Benchaaben
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
F
Fabian Drexel
Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital, Munich, Germany
H
Hartmut Häntze
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
A
Anirudh Narayanan
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
F
Friedrich Puttkammer
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
A
Andrei Zhukov
Department of Gastroenterology, Infectious Diseases and Rheumatology, Charité – Universitätsmedizin Berlin, Berlin, Germany
J
Jacqueline Lammert
Chair of Medical Informatics, Institute of AI in Medicine and Healthcare, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany
S
Sebastian Ziegelmayer
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
M
Markus Graf
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
M
Marion Högner
Department of Medicine III, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
M
Marcus Makowski
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
F
Florian Bassermann
Department of Medicine III, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
L
Lisa C. Adams
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
Jiazhen Pan
Jiazhen Pan
Technical University of Munich
Machine LearningMedical Image ComputingBiomedical Image Analysis
Daniel Rueckert
Daniel Rueckert
Technical University of Munich and Imperial College London
Machine LearningMedical Image ComputingBiomedical Image AnalysisComputer Vision
K
Krischan Braitsch
Department of Medicine III, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
K
Keno K. Bressem
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany