TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capability of large language models (LLMs) to model temporal dependencies in longitudinal electronic health records (EHRs). We propose TIMER-Instruct, a time-aware instruction-tuning paradigm, and introduce TIMER-Bench—the first dedicated evaluation benchmark for temporal clinical reasoning. Methodologically, we innovatively integrate temporally structured prompt design with instruction tuning on longitudinal EHR data to explicitly capture temporal logical relationships across sequential clinical encounters. Our contributions are threefold: (1) the first systematic definition and evaluation of LLMs’ cross-temporal clinical reasoning ability; (2) construction of a high-quality, human-annotated validation set and a large-scale synthetic benchmark—TIMER-Bench; and (3) significant performance gains—+7.3% accuracy on the human-annotated benchmark and +9.2% on TIMER-Bench—demonstrating substantially improved understanding and reasoning over patient-level temporal clinical logic.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have emerged as promising tools for assisting in medical tasks, yet processing Electronic Health Records (EHRs) presents unique challenges due to their longitudinal nature. While LLMs' capabilities to perform medical tasks continue to improve, their ability to reason over temporal dependencies across multiple patient visits and time frames remains unexplored. We introduce TIMER (Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records), a framework that incorporate instruction-response pairs grounding to different parts of a patient's record as a critical dimension in both instruction evaluation and tuning for longitudinal clinical records. We develop TIMER-Bench, the first time-aware benchmark that evaluates temporal reasoning capabilities over longitudinal EHRs, as well as TIMER-Instruct, an instruction-tuning methodology for LLMs to learn reasoning over time. We demonstrate that models fine-tuned with TIMER-Instruct improve performance by 7.3% on human-generated benchmarks and 9.2% on TIMER-Bench, indicating that temporal instruction-tuning improves model performance for reasoning over EHR.
Problem

Research questions and friction points this paper is trying to address.

Challenges in processing longitudinal Electronic Health Records (EHRs).
Lack of exploration in LLMs' temporal reasoning across patient visits.
Improving LLMs' performance in temporal reasoning over EHRs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

TIMER framework for temporal EHR analysis
TIMER-Bench: time-aware benchmark for EHRs
TIMER-Instruct: instruction-tuning for temporal reasoning
🔎 Similar Papers
No similar papers found.