🤖 AI Summary
This work addresses a critical limitation in automated ICD coding of clinical notes: the neglect of diagnostic code ordering. For the first time, it formalizes the task as a joint classification and ranking optimization problem. We propose an end-to-end learning-to-rank framework that integrates a text encoder with a multi-label ranking loss, enabling simultaneous semantic representation learning and explicit modeling of coding priority. Our method achieves a 47% top-1 ranking accuracy for principal diagnoses—improving upon the state-of-the-art by 27 percentage points—while attaining micro-F1 = 0.6065 and macro-F1 = 0.2904, both surpassing existing models. The core contribution lies in introducing a clinically grounded ranking mechanism that reflects the hierarchical importance of diagnoses, thereby advancing ICD coding from static label assignment toward clinically meaningful, order-aware prediction.
📝 Abstract
Clinical notes contain unstructured text provided by clinicians during patient encounters. These notes are usually accompanied by a sequence of diagnostic codes following the International Classification of Diseases (ICD). Correctly assigning and ordering ICD codes are essential for medical diagnosis and reimbursement. However, automating this task remains challenging. State-of-the-art methods treated this problem as a classification task, leading to ignoring the order of ICD codes that is essential for different purposes. In this work, as a first attempt, we approach this task from a retrieval system perspective to consider the order of codes, thus formulating this problem as a classification and ranking task. Our results and analysis show that the proposed framework has a superior ability to identify high-priority codes compared to other methods. For instance, our model accuracy in correctly ranking primary diagnosis codes is 47%, compared to 20% for the state-of-the-art classifier. Additionally, in terms of classification metrics, the proposed model achieves a micro- and macro-F1 scores of 0.6065 and 0.2904, respectively, surpassing the previous best model with scores of 0.597 and 0.2660.