Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This study addresses the lack of systematic, multidimensional evaluation of clinical text generated by large language models, particularly the trade-off between clinical fact preservation and task utility. For the first time at million-scale, it conducts a parallel assessment of synthetically rewritten clinical notes—derived from the MIMIC database—across three dimensions: intrinsic quality, extrinsic utility, and factual consistency. The authors propose a chunked rewriting strategy to mitigate detail loss and integrate automatic similarity metrics, downstream task performance benchmarks, and a hybrid fact-checking approach. Results demonstrate that synthetic texts perform well on coarse-grained tasks but exhibit degraded performance on fine-grained tasks such as ICD coding. The chunked rewriting strategy significantly improves detail retention, notably enhancing the quality of training data for rare ICD codes.

📝 Abstract

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.

Problem

Research questions and friction points this paper is trying to address.

synthetic clinical notes

large language models

clinical text quality

factuality evaluation

ICD coding

Innovation

Methods, ideas, or system contributions that make the work stand out.

systematic evaluation

synthetic clinical notes

chunk-wise rephrasing