🤖 AI Summary
Existing early event prediction (EEP) methods for clinical time series suffer from unstable risk scoring and temporally inconsistent risk trajectories, undermining clinical trustworthiness. To address this, we propose CAREBench—the first multimodal EEP benchmark explicitly designed for clinical reliability—integrating electronic health records, electrocardiogram waveforms, and clinical notes to support end-to-end risk trajectory modeling. We introduce a novel stability metric based on the local Lipschitz constant and, for the first time in multimodal EEP, jointly optimize predictive accuracy and trajectory smoothness. Extensive experiments reveal that state-of-the-art models—including large language models—exhibit markedly low recall in high-precision regimes, exposing critical limitations in evidence alignment and dynamic smoothing. Our findings advocate a new paradigm for risk prediction grounded in clinically interpretable, dynamically smoothed, and evidence-aligned trajectory estimation.
📝 Abstract
Early event prediction (EEP) systems continuously estimate a patient's imminent risk to support clinical decision-making. For bedside trust, risk trajectories must be accurate and temporally stable, shifting only with new, relevant evidence. However, current benchmarks (a) ignore stability of risk scores and (b) evaluate mainly on tabular inputs, leaving trajectory behavior untested. To address this gap, we introduce CAREBench, an EEP benchmark that evaluates deployability using multi-modal inputs-tabular EHR, ECG waveforms, and clinical text-and assesses temporal stability alongside predictive accuracy. We propose a stability metric that quantifies short-term variability in per-patient risk and penalizes abrupt oscillations based on local-Lipschitz constants. CAREBench spans six prediction tasks such as sepsis onset and compares classical learners, deep sequence models, and zero-shot LLMs. Across tasks, existing methods, especially LLMs, struggle to jointly optimize accuracy and stability, with notably poor recall at high-precision operating points. These results highlight the need for models that produce evidence-aligned, stable trajectories to earn clinician trust in continuous monitoring settings. (Code: https://github.com/SeewonChoi/CAREBench.)