🤖 AI Summary
Existing survival analysis methods struggle to model the temporal dynamics inherent in narrative clinical text within electronic health records, limiting the accuracy of personalized risk prediction for oncology patients. To address this, we propose a novel framework integrating temporal text modeling with survival analysis. Specifically, we introduce signature features—originally developed in rough path theory—into medical text analysis for the first time, enabling geometric characterization of sentence embedding evolution over time. Sentence representations are extracted using BERT, then augmented with signature features to capture nonlinear temporal patterns of clinical progression; these enriched representations are fed into a LASSO-regularized Cox proportional hazards model for risk estimation. Evaluated on a real-world oncology cohort, our model achieves a concordance index (C-index) of 0.75 (SD = 0.014), significantly outperforming baseline methods. This demonstrates both the effectiveness and generalizability of signature features for temporally structured clinical text modeling.
📝 Abstract
Electronic medical reports (EHR) contain a vast amount of information that can be leveraged for machine learning applications in healthcare. However, existing survival analysis methods often struggle to effectively handle the complexity of textual data, particularly in its sequential form. Here, we propose SigBERT, an innovative temporal survival analysis framework designed to efficiently process a large number of clinical reports per patient. SigBERT processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings. To capture temporal dynamics from the time series of sentence embedding coordinates, we apply signature extraction from rough path theory to derive geometric features for each patient, which significantly enhance survival model performance by capturing complex temporal dynamics. These features are then integrated into a LASSO-penalized Cox model to estimate patient-specific risk scores. The model was trained and evaluated on a real-world oncology dataset from the Léon Bérard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort. SigBERT integrates sequential medical data to enhance risk estimation, advancing narrative-based survival analysis.