🤖 AI Summary
In privacy-sensitive domains such as clinical NLP, scarce access to real-world annotated data hinders the development of robust de-identification systems. Method: We propose a synthetic data construction paradigm: domain-adapted large language models (LLMs) generate clinical text, and encoder-based NER models (e.g., BERT-CRF) automatically annotate PII entities, yielding synthetic corpora for training de-identification NER models. Contribution/Results: We find that machine annotation quality—not synthetic data scale—determines the upper bound of downstream NER performance; only minimal real data is required for effective LLM domain adaptation. Cross-lingual ablation studies (Swedish/Spanish) show that NER models trained on synthetic data achieve performance close to real-data baselines, with annotation quality being the dominant factor. This work provides an efficient, controllable, and reproducible data substitution framework for privacy-constrained settings.
📝 Abstract
Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.