Key Coverage Matters: Semi-Structured Extraction of OCR Clinical Reports

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study addresses the challenge of extracting structured information from heterogeneous clinical paper or scanned reports across institutions, which is hindered by format variability, OCR noise, and privacy constraints. The authors propose formulating information extraction as a conditional question-answering task over an open key space. A standardized key inventory is constructed through iterative mining, normalization, clustering, and lightweight human verification, with “key coverage” introduced as a novel core metric to evaluate and drive extraction performance. Using a 0.2B BERT-based extractor combined with a boundary-tolerant matching strategy, the method achieves an Exact Match score of 0.839 and a boundary-tolerant F1 of 0.893 on the top-90 standardized keys, significantly outperforming a Qwen3-0.6B baseline. These results validate the dominant role of key coverage in system performance and demonstrate the framework’s language-agnostic transferability.

📝 Abstract

Clinical reports are often fragmented across healthcare institutions because privacy regulations and data silos limit direct information sharing. When patients seek care at a different hospital, they often carry paper or scanned reports from prior visits. This hinders EHR integration and longitudinal review, and downstream applications that depend on more complete patient records, such as patient management, follow-up care, real-world studies, and clinical-trial matching. Although OCR can digitize such reports, reliable extraction remains challenging because clinical documents are heterogeneous, OCR text is noisy, and many healthcare settings require low-cost on-premise deployment. We formulate this problem as canonical key-conditioned extractive question answering over OCR-derived clinical reports. Because the key fields are neither fixed nor known in advance, the key space is open. We maintain a canonical key inventory through iterative key mining, normalization, clustering, and lightweight human verification, and introduce key coverage as a metric to quantify inventory completeness. Using a 0.2B BERT-based model, experiments on real-world reports from more than 20 hospitals show performance improves monotonically with key coverage. The model achieves F1 scores of 0.839 and 0.893 under exact match and boundary-tolerant matching, respectively, once the Top-90 canonical keys are covered. These results show that key coverage is a dominant factor for end-to-end performance. At Top-90 coverage, our model outperforms a fine-tuned Qwen3-0.6B baseline under exact match. Although our annotated corpus is Chinese, the method relies on the language-agnostic key-value organization of semi-structured clinical reports and can be adapted to other settings given an appropriate canonical key inventory and alias mapping.

Problem

Research questions and friction points this paper is trying to address.

clinical report extraction

OCR

key coverage

data fragmentation

EHR integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

key coverage

semi-structured extraction

OCR clinical reports