MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
Medical document OCR remains challenging due to complex layouts, domain-specific terminology, and noisy annotations, leading to limited performance from both conventional OCR systems and general-purpose vision-language models in achieving field-level parsing accuracy. To address this, this work proposes MeDocVL, a query-driven vision-language model tailored for medical documents. MeDocVL introduces a novel training-driven label refinement mechanism coupled with a noise-aware hybrid post-training strategy that integrates reinforcement learning and supervised fine-tuning. This approach significantly enhances robustness and parsing precision under noisy supervision. Evaluated on a medical invoice benchmark, MeDocVL substantially outperforms existing OCR systems and strong vision-language baselines, establishing state-of-the-art performance in the presence of annotation noise.

Technology Category

Application Category

📝 Abstract
Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
Problem

Research questions and friction points this paper is trying to address.

Medical Document OCR
Complex Layouts
Noisy Annotations
Field-level Exact Matching
Domain-specific Terminology
Innovation

Methods, ideas, or system contributions that make the work stand out.

Medical Document Parsing
Vision-Language Model
Noisy Annotation
Label Refinement
Hybrid Post-training
🔎 Similar Papers