🤖 AI Summary
This work addresses the challenge of automating resource allocation and prioritization in asynchronous outpatient patient portal messaging by framing triage as a pairwise comparison task. Leveraging large language models to assess message urgency, the approach simulates physician inbox reordering. The study introduces PMR-Bench, the first large-scale public triage benchmark, and proposes a novel pairwise ranking paradigm alongside a scalable strategy for training data generation and domain-adaptive annotation. By integrating Bradley-Terry preference learning with supervised fine-tuning (SFT) on real-world electronic health records and unstructured patient messages, the authors develop UrgentReward and UrgentSFT models. Experimental results demonstrate that UrgentSFT-8B and UrgentReward-8B outperform off-the-shelf 8B models by 15 and 16 percentage points, respectively, on inbox prioritization metrics, confirming the efficacy of the proposed methodology.
📝 Abstract
Medical triage is the task of allocating medical resources and prioritizing patients based on medical need. This paper introduces the first large-scale public dataset for studying medical triage in the context of asynchronous outpatient portal messages. Our novel task formulation views patient message triage as a pairwise inference problem, where we train LLMs to choose `"which message is more medically urgent"in a head-to-head tournament-style re-sort of a physician's inbox. Our novel benchmark PMR-Bench contains 1569 unique messages and 2,000+ high-quality test pairs for pairwise medical urgency assessment alongside a scalable training data generation pipeline. PMR-Bench includes samples that contain both unstructured patient-written messages alongside real electronic health record (EHR) data, emulating a real-world medical triage scenario. We develop a novel automated data annotation strategy to provide LLMs with in-domain guidance on this task. The resulting data is used to train two model classes, UrgentReward and UrgentSFT, leveraging Bradley-Terry and next token prediction objective, respectively to perform pairwise urgency classification. We find that UrgentSFT achieves top performance on PMR-Bench, with UrgentReward showing distinct advantages in low-resource settings. For example, UrgentSFT-8B and UrgentReward-8B provide a 15- and 16-point boost, respectively, on inbox sorting metrics over off-the-shelf 8B models. Paper resources can be found at https://tinyurl.com/Patient-Message-Triage