RuCCoD: Towards Automated ICD Coding in Russian

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This study addresses the critical scarcity of Russian-language clinical coding resources. We introduce the first large-scale Russian ICD coding dataset, comprising over 10,000 diagnosis entities and 1,500+ unique ICD-10 codes, and systematically investigate automated ICD coding for Russian clinical text. Our method proposes a novel cross-domain (PubMed → clinical diagnoses) and cross-terminology (UMLS → ICD) transfer learning framework, integrating domain adaptation, terminology alignment, LoRA-fine-tuned LLaMA, and RAG-enhanced generation. Evaluated on a rigorously constructed test set, our approach achieves significant improvements in coding accuracy. It has been successfully deployed on real-world Russian electronic health records from 2017–2021. This work establishes the first reproducible, scalable technical pathway and benchmark dataset for automated clinical coding in low-resource languages, advancing NLP applications in multilingual healthcare informatics.

Technology Category

Application Category

📝 Abstract

This study investigates the feasibility of automating clinical coding in Russian, a language with limited biomedical resources. We present a new dataset for ICD coding, which includes diagnosis fields from electronic health records (EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD codes. This dataset serves as a benchmark for several state-of-the-art models, including BERT, LLaMA with LoRA, and RAG, with additional experiments examining transfer learning across domains (from PubMed abstracts to medical diagnosis) and terminologies (from UMLS concepts to ICD codes). We then apply the best-performing model to label an in-house EHR dataset containing patient histories from 2017 to 2021. Our experiments, conducted on a carefully curated test set, demonstrate that training with the automated predicted codes leads to a significant improvement in accuracy compared to manually annotated data from physicians. We believe our findings offer valuable insights into the potential for automating clinical coding in resource-limited languages like Russian, which could enhance clinical efficiency and data accuracy in these contexts.

Problem

Research questions and friction points this paper is trying to address.

Automating ICD coding in Russian with limited biomedical resources.

Creating a dataset for ICD coding from Russian EHRs.

Improving accuracy of clinical coding using automated models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

New dataset for ICD coding in Russian

Benchmarking with BERT, LLaMA, RAG models

Transfer learning from PubMed to medical diagnosis

🔎 Similar Papers

No similar papers found.