Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Current general-purpose large language models (LLMs) suffer from hierarchical misalignment—generating semantically proximal yet incorrect ICD codes—in clinical coding tasks. Moreover, mainstream benchmarks (e.g., MIMIC) exhibit critical limitations: insufficient supporting evidence in notes and strong inpatient bias, undermining model reliability and generalizability. To address these issues, we propose a trustworthy clinical coding framework comprising three core components: (1) formalizing code-level hierarchical validation as a novel auxiliary task; (2) constructing the first expert-annotated, multi-department outpatient benchmark dataset with dual annotations per case; and (3) integrating prompt engineering, lightweight fine-tuning, and an ICD hierarchy-aware verification module for end-to-end error correction. Experiments demonstrate substantial reductions in near-miss errors, significant improvements in coding accuracy (+4.2% macro-F1) and robustness across diverse clinical scenarios. Our framework establishes a high-fidelity, clinically grounded solution for automated ICD coding.

Technology Category

Application Category

📝 Abstract

Accurate clinical coding is essential for healthcare documentation, billing, and decision-making. While prior work shows that off-the-shelf LLMs struggle with this task, evaluations based on exact match metrics often overlook errors where predicted codes are hierarchically close but incorrect. Our analysis reveals that such hierarchical misalignments account for a substantial portion of LLM failures. We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods. To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component. To mitigate the limitations in existing datasets, such as incomplete evidence and inpatient bias in MIMIC, we release an expert double-annotated benchmark of outpatient clinical notes with ICD-10 codes. Our results highlight verification as an effective and reliable step toward improving LLM-based medical coding.

Problem

Research questions and friction points this paper is trying to address.

Addressing hierarchical misalignments in clinical coding

Improving accuracy through lightweight interventions and verification

Mitigating dataset limitations with expert-annotated outpatient benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight fine-tuning improves coding accuracy

Clinical code verification reduces hierarchical errors

Expert-annotated outpatient benchmark addresses dataset limitations

🔎 Similar Papers

No similar papers found.