🤖 AI Summary
Existing automatic ICD coding systems lack interpretability for clinical long-text, multi-label scenarios, and current evaluation methods fail to validate evidence–code consistency. Method: Leveraging the MDACE dataset, we propose a novel evidence matching metric that quantifies semantic overlap between model-extracted textual evidence and ground-truth code descriptions, integrating text matching and evidence alignment techniques to systematically assess the effectiveness and bias of mainstream interpretability methods in evidence extraction. Contribution/Results: Our evaluation identifies successful and failure cases, revealing that while current methods capture partial ground-truth evidence, overall evidence–code consistency remains limited. The proposed metric significantly enhances objectivity and clinical plausibility in interpretability assessment. This work establishes a reproducible, clinically grounded evaluation framework—supporting the development, diagnostic traceability, and trustworthy deployment of interpretable medical coding systems.
📝 Abstract
Automatic medical coding has the potential to ease documentation and billing processes. For this task, transparency plays an important role for medical coders and regulatory bodies, which can be achieved using explainability methods. However, the evaluation of these approaches has been mostly limited to short text and binary settings due to a scarcity of annotated data. Recent efforts by Cheng et al. (2023) have introduced the MDACE dataset, which provides a valuable resource containing code evidence in clinical records. In this work, we conduct an in-depth analysis of the MDACE dataset and perform plausibility evaluation of current explainable medical coding systems from an applied perspective. With this, we contribute to a deeper understanding of automatic medical coding and evidence extraction. Our findings reveal that ground truth evidence aligns with code descriptions to a certain degree. An investigation into state-of-the-art approaches shows a high overlap with ground truth evidence. We propose match measures and highlight success and failure cases. Based on our findings, we provide recommendations for developing and evaluating explainable medical coding systems.