🤖 AI Summary
This work addresses the challenge of efficiently and accurately identifying relevant codes from standardized medical terminologies for clinical value set curation. To overcome limitations in both efficiency and precision, the authors propose Retrieval-Augmented Set Completion (RASC), a two-stage approach that first constructs a candidate pool via semantic retrieval from an existing value set corpus and then employs a classifier to filter relevant codes, thereby substantially reducing the output space. The study introduces the first large-scale benchmark for automated clinical value set authoring and demonstrates the superiority of this retrieval-classification paradigm. Experimental results on 11,803 VSAC value sets show that a fine-tuned SAPBERT cross-encoder achieves an AUROC of 0.852 and a value set–level F1 score of 0.298, reducing the number of irrelevant candidates per true positive from 12.3 to 3.2—significantly outperforming zero-shot baselines such as GPT-4o.
📝 Abstract
Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the $K$ most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6\% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}.