MedRepBench: A Comprehensive Benchmark for Medical Report Interpretation

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of cross-departmental, multi-format standardized benchmarks hinders objective evaluation of structured medical report interpretation. Method: We introduce MedRepBench—the first comprehensive benchmark comprising 1,900 real-world Chinese clinical reports—and propose a dual-protocol evaluation framework: (i) objective assessment via field-level recall, and (ii) automated subjective scoring (factuality, explainability, reasoning quality) powered by large language models (LLMs). We further design contrastive experiments leveraging high-fidelity OCR text and propose an objective reward function for structured extraction, optimized via Group Relative Policy Optimization (GRPO) to enhance mid-scale vision-language models. Contribution/Results: Our analysis reveals inherent limitations of OCR+LLM pipelines in layout awareness and latency. GRPO yields up to 6% absolute recall gain; ablation confirms the robustness and efficiency advantages of pure-vision pathways, advancing end-to-end clinical report understanding toward real-world deployment.

Technology Category

Application Category

📝 Abstract
Medical report interpretation plays a crucial role in healthcare, enabling both patient-facing explanations and effective information flow across clinical systems. While recent vision-language models (VLMs) and large language models (LLMs) have demonstrated general document understanding capabilities, there remains a lack of standardized benchmarks to assess structured interpretation quality in medical reports. We introduce MedRepBench, a comprehensive benchmark built from 1,900 de-identified real-world Chinese medical reports spanning diverse departments, patient demographics, and acquisition formats. The benchmark is designed primarily to evaluate end-to-end VLMs for structured medical report understanding. To enable controlled comparisons, we also include a text-only evaluation setting using high-quality OCR outputs combined with LLMs, allowing us to estimate the upper-bound performance when character recognition errors are minimized. Our evaluation framework supports two complementary protocols: (1) an objective evaluation measuring field-level recall of structured clinical items, and (2) an automated subjective evaluation using a powerful LLM as a scoring agent to assess factuality, interpretability, and reasoning quality. Based on the objective metric, we further design a reward function and apply Group Relative Policy Optimization (GRPO) to improve a mid-scale VLM, achieving up to 6% recall gain. We also observe that the OCR+LLM pipeline, despite strong performance, suffers from layout-blindness and latency issues, motivating further progress toward robust, fully vision-based report understanding.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized benchmarks for medical report structured interpretation
Evaluating vision-language models on real-world medical report understanding
Assessing structured clinical item recall and reasoning quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark using 1900 real-world medical reports
Objective field recall and subjective LLM scoring evaluation
GRPO optimization achieving 6% recall gain in VLM
🔎 Similar Papers
No similar papers found.
F
Fangxin Shang
Baidu Inc., China
Y
Yuan Xia
Baidu Inc., China
Dalu Yang
Dalu Yang
Baidu Inc
Computer VisionMedical Image Analysis
Y
Yahui Wang
Baidu Inc., China
B
Binglin Yang
Baidu Inc., China