Term2Note: Synthesising Differentially Private Clinical Notes from Medical Terms

📅 2025-09-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In healthcare, real-world clinical texts are rarely accessible for model training due to stringent privacy regulations. To address this, we propose a differential privacy (DP)-enabled synthetic data generation framework tailored for long-form clinical records. Our method decouples content—comprising medical terminology—from structural elements—such as section-level organization—and applies independent DP constraints to each component. We further introduce a staged generation strategy coupled with a DP-aware quality selector to jointly optimize utility and privacy, guaranteeing strict ε-differential privacy. Experimental results demonstrate that the synthetic clinical notes closely approximate real data in both statistical distributions and semantic fidelity. When used to train multi-label classification models, they achieve 96.2% of the performance attained using real clinical notes—substantially outperforming existing DP-compliant text generation approaches.

Technology Category

Application Category

📝 Abstract
Training data is fundamental to the success of modern machine learning models, yet in high-stakes domains such as healthcare, the use of real-world training data is severely constrained by concerns over privacy leakage. A promising solution to this challenge is the use of differentially private (DP) synthetic data, which offers formal privacy guarantees while maintaining data utility. However, striking the right balance between privacy protection and utility remains challenging in clinical note synthesis, given its domain specificity and the complexity of long-form text generation. In this paper, we present Term2Note, a methodology to synthesise long clinical notes under strong DP constraints. By structurally separating content and form, Term2Note generates section-wise note content conditioned on DP medical terms, with each governed by separate DP constraints. A DP quality maximiser further enhances synthetic notes by selecting high-quality outputs. Experimental results show that Term2Note produces synthetic notes with statistical properties closely aligned with real clinical notes, demonstrating strong fidelity. In addition, multi-label classification models trained on these synthetic notes perform comparably to those trained on real data, confirming their high utility. Compared to existing DP text generation baselines, Term2Note achieves substantial improvements in both fidelity and utility while operating under fewer assumptions, suggesting its potential as a viable privacy-preserving alternative to using sensitive clinical notes.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing clinical notes with differential privacy
Balancing privacy and utility in medical text generation
Generating realistic synthetic healthcare data securely
Innovation

Methods, ideas, or system contributions that make the work stand out.

DP synthetic clinical notes generation
Separates content and form structurally
DP quality maximiser enhances output
🔎 Similar Papers
No similar papers found.
Y
Yuping Wu
Univeristy of Manchester
Viktor Schlegel
Viktor Schlegel
Deputy Director IN-CYPHER Programme @ IGS, Imperial College London
Natural Language UnderstandingAI for HealthcareClinical NLPAI Evaluation
W
Warren Del-Pinto
Univeristy of Manchester
S
Srinivasan Nandakumar
Imperial College London, Imperial Global Singapore
I
Iqra Zahid
Imperial College London, Imperial Global Singapore
Y
Yidan Sun
Imperial College London, Imperial Global Singapore
U
Usama Farghaly Omar
Khoo Teck Puat Hospital, Singapore
A
Amirah Jasmine
Imperial College London
Arun-Kumar Kaliya-Perumal
Arun-Kumar Kaliya-Perumal
Nanyang Technological University, Singapore
Musculoskeletal HealthOrthopaedic Spine SurgeryDisease ModelingGenetics
C
Chun Shen Tham
Univeristy of Manchester
G
Gabriel Connors
Univeristy of Manchester
A
Anil A Bharath
Imperial College London
Goran Nenadic
Goran Nenadic
Department of Computer Science, University of Manchester
Natural language processingtext mininghealth informatics