SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing survival analysis methods struggle to model the temporal dynamics inherent in narrative clinical text within electronic health records, limiting the accuracy of personalized risk prediction for oncology patients. To address this, we propose a novel framework integrating temporal text modeling with survival analysis. Specifically, we introduce signature features—originally developed in rough path theory—into medical text analysis for the first time, enabling geometric characterization of sentence embedding evolution over time. Sentence representations are extracted using BERT, then augmented with signature features to capture nonlinear temporal patterns of clinical progression; these enriched representations are fed into a LASSO-regularized Cox proportional hazards model for risk estimation. Evaluated on a real-world oncology cohort, our model achieves a concordance index (C-index) of 0.75 (SD = 0.014), significantly outperforming baseline methods. This demonstrates both the effectiveness and generalizability of signature features for temporally structured clinical text modeling.

Technology Category

Application Category

📝 Abstract

Electronic medical reports (EHR) contain a vast amount of information that can be leveraged for machine learning applications in healthcare. However, existing survival analysis methods often struggle to effectively handle the complexity of textual data, particularly in its sequential form. Here, we propose SigBERT, an innovative temporal survival analysis framework designed to efficiently process a large number of clinical reports per patient. SigBERT processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings. To capture temporal dynamics from the time series of sentence embedding coordinates, we apply signature extraction from rough path theory to derive geometric features for each patient, which significantly enhance survival model performance by capturing complex temporal dynamics. These features are then integrated into a LASSO-penalized Cox model to estimate patient-specific risk scores. The model was trained and evaluated on a real-world oncology dataset from the Léon Bérard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort. SigBERT integrates sequential medical data to enhance risk estimation, advancing narrative-based survival analysis.

Problem

Research questions and friction points this paper is trying to address.

Handling complexity of textual data in survival analysis

Processing sequential medical reports for temporal dynamics

Enhancing survival risk estimation with narrative medical data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts word embeddings into sentence embeddings

Applies rough path theory for temporal dynamics

Integrates LASSO-penalized Cox model for risk scores

🔎 Similar Papers

Can-SAVE: Mass Cancer Risk Prediction via Survival Analysis Variables and EHR