From Documents to Spans: Code-Centric Learning for LLM-based ICD Coding

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in applying large language models (LLMs) to ICD coding—namely, insufficient training data coverage, poor interpretability, and high computational costs associated with long clinical documents. To overcome these limitations, the authors propose a code-centric learning framework that shifts supervision from full clinical notes to short, evidence-based text snippets, enabling snippet-level training to enhance document-level coding performance. By incorporating code-centric data augmentation and a hybrid fine-tuning strategy, the method significantly reduces training overhead while improving generalization to unseen ICD codes and preserving decision interpretability. Experimental results demonstrate that, using the same LLM backbone, the proposed approach substantially outperforms strong baselines, enabling small open-source models to achieve coding accuracy comparable to that of large proprietary models.

Technology Category

Application Category

📝 Abstract
ICD coding is a critical yet challenging task in healthcare. Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model's ability to generalize to unseen codes. Second, naive fine-tuning diminishes the interpretability of LLMs, as few public datasets contain explicit supporting evidence for assigned codes. Third, ICD coding typically involves long clinical documents, making fine-tuning LLMs computationally expensive. To address these issues, we propose Code-Centric Learning, a training framework that shifts supervision from full clinical documents to scalable, short evidence spans. The key idea of this framework is that span-level learning improves LLMs' ability to perform document-level ICD coding. Our proposed framework consists of a mixed training strategy and code-centric data expansion, which substantially reduces training cost, improves accuracy on unseen ICD codes and preserves interpretability. Under the same LLM backbone, our method substantially outperforms strong baselines. Notably, our method enables small-scale LLMs to achieve performance comparable to much larger proprietary models, demonstrating its effectiveness and potential for fully automated ICD coding.
Problem

Research questions and friction points this paper is trying to address.

ICD coding
large language models
fine-tuning
interpretability
data sparsity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Code-Centric Learning
evidence spans
ICD coding
large language models
data augmentation
🔎 Similar Papers
No similar papers found.
X
Xu Zhang
School of Biomedical Engineering, Division of Life Sciences and Medicine, USTC; MIRACLE Center, Suzhou Institute for Advance Research, USTC
Wenxin Ma
Wenxin Ma
University of Science and Technology of China
AIcomputer vision
Chenxu Wu
Chenxu Wu
USTC
diffusion-based methods,multimodal learning
Rongsheng Wang
Rongsheng Wang
The Chinese University of Hong Kong, Shenzhen
Deep Learning
K
Kun Zhang
School of Biomedical Engineering, Division of Life Sciences and Medicine, USTC; MIRACLE Center, Suzhou Institute for Advance Research, USTC
S
S. Kevin Zhou
School of Biomedical Engineering, Division of Life Sciences and Medicine, USTC; MIRACLE Center, Suzhou Institute for Advance Research, USTC; Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology; State Key Laboratory of Precision and Intelligent Chemistry, USTC