π€ AI Summary
This work addresses the challenge of deploying large language models (LLMs) safely in clinical settings for medical entity extraction, where poor confidence calibration often undermines reliability. To this end, the authors propose a conformal prediction framework that provides finite-sample coverage guarantees for entity extraction across two clinical domains: FDA drug labels and MIMIC-CXR radiology reports. Their analysis reveals that calibration bias is significantly influenced by document structure, entity type, and model architecture, necessitating tailored calibration strategies. By integrating GPT-4.1 and Llama-4-Maverick for extraction and leveraging FactScore atomic statement evaluation alongside physician annotations, the method dynamically sets conformal thresholds to control risk. Experiments demonstrate β₯90% target coverage in both domains with rejection rates of only 9%β13%, confirming the approachβs effectiveness and clinical practicality.
π Abstract
Large Language Models (LLMs) are increasingly used for medical entity extraction, yet their confidence scores are often miscalibrated, limiting safe deployment in clinical settings. We present a conformal prediction framework that provides finite-sample coverage guarantees for LLM-based extraction across two clinical domains. First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7\% accuracy over 128,906 entities). Second, we extract radiological entities from MIMIC-CXR reports using the RadGraph schema with GPT-4.1 and Llama-4-Maverick, evaluated against physician annotations (entity F1: 0.81 to 0.84). Our central finding is that miscalibration direction reverses across domains: on well-structured FDA labels, models are underconfident, requiring modest conformal thresholds ($Ο\approx 0.06$), while on free-text radiology reports, models are overconfident, demanding strict thresholds ($Ο$ up to 0.99). Despite this heterogeneity, conformal prediction achieves target coverage ($\geq 90\%$) in both settings with manageable rejection rates (9--13\%). These results demonstrate that calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment.