MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing evaluation methodologies struggle to assess model robustness against unseen field keys and OCR noise in real-world clinical OCR reports, particularly for key discovery, key-conditioned question answering, and end-to-end key-value extraction tasks. This work proposes MedStruct-S, a benchmark comprising 3,582 pages of annotated real-world OCR clinical reports, which for the first time systematically models the impact of unknown keys and OCR artifacts on information extraction and establishes a comprehensive evaluation framework grounded in realistic scenarios. Evaluations across nine encoder-only and decoder-only models—spanning 0.11B to 103B parameters—reveal that encoder-based architectures excel in non-null key-conditioned question answering, while fine-tuned decoder-only models achieve the strongest overall performance when scale is uncontrolled.

📝 Abstract

Semi-structured information extraction (IE) from OCR-derived clinical reports is crucial for efficiently reconstructing patients' longitudinal medical histories. In practice, this scenario commonly involves three tasks: (i) field-header (key) discovery, (ii) key-conditioned question answering (QA), and (iii) end-to-end key-value pair extraction. However, existing evaluations often under-model two factors: heterogeneous and incompletely known key representations, and OCR-induced noise. This makes it difficult to assess model robustness in real-world settings. We present MedStruct-S, a benchmark specifically designed to evaluate these tasks under unknown keys and OCR noise. MedStruct-S contains 3,582 annotated real-world clinical report pages. Using MedStruct-S, we benchmark two representative paradigms: encoder-only sequence labeling with post-processing and decoder-only structured generation, covering four encoder-only and five decoder-only models spanning 0.11B to 103B parameters. Our results show that encoder-only models achieve the best performance for non-null-value key-conditioned QA despite being substantially smaller than decoder-only models. When comparing models of similar order of magnitude, encoder-only models still perform better overall. Without controlling for model scale, fine-tuned decoder-only models deliver the strongest overall results. These findings show that the benchmark provides a reliable and practical basis for selecting and comparing models across different semi-structured IE settings.

Problem

Research questions and friction points this paper is trying to address.

semi-structured information extraction

OCR noise

key discovery

clinical reports

robustness evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

semi-structured information extraction

OCR clinical reports

key discovery