Universal Abstraction: Harnessing Frontier Models to Structure Real-World Data at Scale

📅 2025-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Clinical information extraction from unstructured medical texts suffers from low efficiency and heavy reliance on labor-intensive manual annotation. Method: We propose a zero-shot, scalable, LLM-driven medical abstraction framework built upon GPT-4o, featuring a modular, customizable hierarchical prompting template that integrates context-aware attribute parsing with longitudinal clinical course modeling. Contribution/Results: Our framework enables plug-and-play extraction across 15 critical oncology attributes—without attribute-specific rules or labeled training data—establishing the first generalizable abstraction paradigm that eliminates dependence on supervised learning or handcrafted heuristics. Evaluated on real-world clinical data, it achieves an average improvement of 2 percentage points in overall F1 score and accuracy; notably, pathological T-stage classification accuracy increases by 20 percentage points over supervised baselines, demonstrating superior generalizability and clinical utility.

Technology Category

Application Category

📝 Abstract
The vast majority of real-world patient information resides in unstructured clinical text, and the process of medical abstraction seeks to extract and normalize structured information from this unstructured input. However, traditional medical abstraction methods can require significant manual efforts that can include crafting rules or annotating training labels, limiting scalability. In this paper, we propose UniMedAbstractor (UMA), a zero-shot medical abstraction framework leveraging Large Language Models (LLMs) through a modular and customizable prompt template. We refer to our approach as universal abstraction as it can quickly scale to new attributes through its universal prompt template without curating attribute-specific training labels or rules. We evaluate UMA for oncology applications, focusing on fifteen key attributes representing the cancer patient journey, from short-context attributes (e.g., performance status, treatment) to complex long-context attributes requiring longitudinal reasoning (e.g., tumor site, histology, TNM staging). Experiments on real-world data show UMA's strong performance and generalizability. Compared to supervised and heuristic baselines, UMA with GPT-4o achieves on average an absolute 2-point F1/accuracy improvement for both short-context and long-context attribute abstraction. For pathologic T staging, UMA even outperforms the supervised model by 20 points in accuracy.
Problem

Research questions and friction points this paper is trying to address.

Clinical Text Mining
Patient Information Extraction
Real-world Data Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

UniMedAbstractor
universal abstraction
GPT-4o
C
Cliff Wong
Microsoft, Redmond, WA, USA
S
Sam Preston
Microsoft, Redmond, WA, USA
Qianchu Liu
Qianchu Liu
Microsoft Research
Natural Language Processing
Z
Zelalem Gero
Microsoft, Redmond, WA, USA
J
Jass Bagga
Microsoft, Redmond, WA, USA
S
Sheng Zhang
Microsoft, Redmond, WA, USA
Shrey Jain
Shrey Jain
Microsoft, Redmond, WA, USA
T
Theodore Zhao
Microsoft, Redmond, WA, USA
Y
Yu Gu
Microsoft, Redmond, WA, USA
Yanbo Xu
Yanbo Xu
Microsoft
Sid Kiblawi
Sid Kiblawi
Microsoft
Computational BiologyMachine LearningNLP
R
R. Weerasinghe
Providence Genomics, Portland, OR, USA
R
R. Leidner
Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA; Providence Genomics, Portland, OR, USA
K
Kristina Young
The Oregon Clinic, Radiation Oncology Division, Portland, OR; Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA
B
B. Piening
Providence Genomics, Portland, OR, USA; Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA
C
Carlo Bifulco
Providence Genomics, Portland, OR, USA; Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA
Tristan Naumann
Tristan Naumann
Principal Researcher, Microsoft Research Health Futures
Artificial IntelligenceMachine LearningNatural Language ProcessingClinical Inference
M
Mu Wei
Microsoft, Redmond, WA, USA
H
H. Poon
Microsoft, Redmond, WA, USA