🤖 AI Summary
Clinical NLP faces challenges including scarcity of real-world data, stringent privacy requirements, and the complexity of medical terminology. Method: We propose ClinGen, a resource-efficient knowledge-injection framework for synthetic clinical text generation. ClinGen introduces a novel prompting mechanism that synergistically integrates external biomedical knowledge graphs with large language models (LLMs) to jointly model clinical topics and writing styles, ensuring privacy preservation and regulatory compliance while enhancing fidelity and lexical/semantic diversity. The approach comprises knowledge extraction, knowledge graph embedding, and context-aware prompt engineering, complemented by a multi-task evaluation framework. Results: Extensive experiments across seven clinical NLP tasks and sixteen benchmark datasets demonstrate significant performance gains. Generated data exhibit improved distributional alignment with real clinical corpora, and training sample diversity increases by an average of 42%.
📝 Abstract
Clinical natural language processing requires methods that can address domain-specific challenges, such as complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we delve into synthetic clinical text generation using LLMs for clinical NLP tasks. We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domain-specific knowledge graphs and LLMs to guide data generation. Our extensive empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks, effectively aligning the distribution of real datasets and significantly enriching the diversity of generated training instances. Our code is available at url{https://github.com/ritaranx/ClinGen}.