🤖 AI Summary
This study addresses the critical scarcity of Russian-language clinical coding resources. We introduce the first large-scale Russian ICD coding dataset, comprising over 10,000 diagnosis entities and 1,500+ unique ICD-10 codes, and systematically investigate automated ICD coding for Russian clinical text. Our method proposes a novel cross-domain (PubMed → clinical diagnoses) and cross-terminology (UMLS → ICD) transfer learning framework, integrating domain adaptation, terminology alignment, LoRA-fine-tuned LLaMA, and RAG-enhanced generation. Evaluated on a rigorously constructed test set, our approach achieves significant improvements in coding accuracy. It has been successfully deployed on real-world Russian electronic health records from 2017–2021. This work establishes the first reproducible, scalable technical pathway and benchmark dataset for automated clinical coding in low-resource languages, advancing NLP applications in multilingual healthcare informatics.
📝 Abstract
This study investigates the feasibility of automating clinical coding in Russian, a language with limited biomedical resources. We present a new dataset for ICD coding, which includes diagnosis fields from electronic health records (EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD codes. This dataset serves as a benchmark for several state-of-the-art models, including BERT, LLaMA with LoRA, and RAG, with additional experiments examining transfer learning across domains (from PubMed abstracts to medical diagnosis) and terminologies (from UMLS concepts to ICD codes). We then apply the best-performing model to label an in-house EHR dataset containing patient histories from 2017 to 2021. Our experiments, conducted on a carefully curated test set, demonstrate that training with the automated predicted codes leads to a significant improvement in accuracy compared to manually annotated data from physicians. We believe our findings offer valuable insights into the potential for automating clinical coding in resource-limited languages like Russian, which could enhance clinical efficiency and data accuracy in these contexts.