Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurate ICD-10 coding of hospital discharge summaries is clinically critical yet error-prone due to its hierarchical, fine-grained nature. Method: We evaluated 11 large language models (LLMs), including Gemini 2.5 Pro, on high-frequency ICD-10 codes using a standardized coder-style prompting template; clinical entities were pre-extracted via cTAKES to enhance input fidelity. Hierarchical classification was performed to reflect real-world coding constraints. Contribution/Results: We introduced and systematically compared LLMs with structured reasoning capabilities against non-reasoning counterparts. All models achieved ≤57% macro-F1, with performance inversely correlated with code specificity. Reasoning-capable models consistently outperformed non-reasoning ones, with Gemini 2.5 Pro achieving the highest F1. These findings indicate that current LLMs can serve as efficiency-enhancing decision-support tools in clinical coding but remain insufficient to replace human coders due to persistent accuracy limitations in complex hierarchical classification.

Technology Category

Application Category

📝 Abstract
This study evaluates how well large language models (LLMs) can classify ICD-10 codes from hospital discharge summaries, a critical but error-prone task in healthcare. Using 1,500 summaries from the MIMIC-IV dataset and focusing on the 10 most frequent ICD-10 codes, the study tested 11 LLMs, including models with and without structured reasoning capabilities. Medical terms were extracted using a clinical NLP tool (cTAKES), and models were prompted in a consistent, coder-like format. None of the models achieved an F1 score above 57%, with performance dropping as code specificity increased. Reasoning-based models generally outperformed non-reasoning ones, with Gemini 2.5 Pro performing best overall. Some codes, such as those related to chronic heart disease, were classified more accurately than others. The findings suggest that while LLMs can assist human coders, they are not yet reliable enough for full automation. Future work should explore hybrid methods, domain-specific model training, and the use of structured clinical data.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs for ICD-10 code classification accuracy
Compare reasoning vs non-reasoning LLMs in clinical coding
Assess limitations of LLMs in automating medical coding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Used reasoning-based LLMs for ICD-10 classification
Employed clinical NLP tool for term extraction
Tested hybrid methods for future improvements
🔎 Similar Papers
No similar papers found.