Evaluation and LLM-Guided Learning of ICD Coding Rationales

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Deep learning models for automatic clinical coding (e.g., ICD coding) suffer from poor interpretability, undermining clinical trustworthiness. Existing interpretability studies rely heavily on attention mechanisms and lack high-quality rationale datasets and dedicated rationale generation methods. Method: We propose (1) a fine-grained, high-density ICD code rationale annotation dataset; (2) a dual-perspective evaluation framework assessing both *credibility* (clinical plausibility) and *faithfulness* (fidelity to model behavior); and (3) the first use of large language models (LLMs) to generate rationales as distant supervision signals, enabling few-shot guided supervised learning with minimal human annotations. Results: LLM-generated rationales achieve high agreement with expert judgments and significantly outperform baselines in both faithfulness and plausibility. Moreover, integrating these rationales improves downstream coding accuracy, demonstrating their utility for enhancing model transparency and performance.

Technology Category

Application Category

📝 Abstract

Automated clinical coding involves mapping unstructured text from Electronic Health Records (EHRs) to standardized code systems such as the International Classification of Diseases (ICD). While recent advances in deep learning have significantly improved the accuracy and efficiency of ICD coding, the lack of explainability in these models remains a major limitation, undermining trust and transparency. Current explorations about explainability largely rely on attention-based techniques and qualitative assessments by physicians, yet lack systematic evaluation using consistent criteria on high-quality rationale datasets, as well as dedicated approaches explicitly trained to generate rationales for further enhancing explanation. In this work, we conduct a comprehensive evaluation of the explainability of the rationales for ICD coding through two key lenses: faithfulness that evaluates how well explanations reflect the model's actual reasoning and plausibility that measures how consistent the explanations are with human expert judgment. To facilitate the evaluation of plausibility, we construct a new rationale-annotated dataset, offering denser annotations with diverse granularity and aligns better with current clinical practice, and conduct evaluation across three types of rationales of ICD coding. Encouraged by the promising plausibility of LLM-generated rationales for ICD coding, we further propose new rationale learning methods to improve the quality of model-generated rationales, where rationales produced by prompting LLMs with/without annotation examples are used as distant supervision signals. We empirically find that LLM-generated rationales align most closely with those of human experts. Moreover, incorporating few-shot human-annotated examples not only further improves rationale generation but also enhances rationale-learning approaches.

Problem

Research questions and friction points this paper is trying to address.

Evaluating explainability of ICD coding models using faithfulness and plausibility metrics

Creating a high-quality rationale dataset with diverse granularity annotations

Developing LLM-guided learning methods to improve rationale generation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated rationales for distant supervision

Comprehensive evaluation using faithfulness and plausibility

Few-shot human examples enhance rationale learning

🔎 Similar Papers

No similar papers found.