MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical document OCR remains challenging due to complex layouts, domain-specific terminology, and noisy annotations, leading to limited performance from both conventional OCR systems and general-purpose vision-language models in achieving field-level parsing accuracy. To address this, this work proposes MeDocVL, a query-driven vision-language model tailored for medical documents. MeDocVL introduces a novel training-driven label refinement mechanism coupled with a noise-aware hybrid post-training strategy that integrates reinforcement learning and supervised fine-tuning. This approach significantly enhances robustness and parsing precision under noisy supervision. Evaluated on a medical invoice benchmark, MeDocVL substantially outperforms existing OCR systems and strong vision-language baselines, establishing state-of-the-art performance in the presence of annotation noise.

Technology Category

Application Category

📝 Abstract
Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
Problem

Research questions and friction points this paper is trying to address.

Medical Document OCR
Complex Layouts
Noisy Annotations
Field-level Exact Matching
Domain-specific Terminology
Innovation

Methods, ideas, or system contributions that make the work stand out.

Medical Document Parsing
Vision-Language Model
Noisy Annotation
Label Refinement
Hybrid Post-training
🔎 Similar Papers
W
Wenjie Wang
W
Wei Wu
Y
Ying Liu
Yuan Zhao
Yuan Zhao
Lanzhou University of Technology
time series forecasting
X
Xiaole Lv
L
Liang Diao
Ping An Property & Casualty Insurance Company
Z
Zengjian Fan
W
Wenfeng Xie
Z
Ziling Lin
D
De Shi
Lin Huang
Lin Huang
Stanford University, The Chinese University of Hong Kong
computational genomicsfault-tolerant computingdesign automationand multi-core architecture
K
Kaihe Xu
H
Hong Li