MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Medical document OCR remains challenging due to complex layouts, domain-specific terminology, and noisy annotations, leading to limited performance from both conventional OCR systems and general-purpose vision-language models in achieving field-level parsing accuracy. To address this, this work proposes MeDocVL, a query-driven vision-language model tailored for medical documents. MeDocVL introduces a novel training-driven label refinement mechanism coupled with a noise-aware hybrid post-training strategy that integrates reinforcement learning and supervised fine-tuning. This approach significantly enhances robustness and parsing precision under noisy supervision. Evaluated on a medical invoice benchmark, MeDocVL substantially outperforms existing OCR systems and strong vision-language baselines, establishing state-of-the-art performance in the presence of annotation noise.

Technology Category

Application Category

📝 Abstract

Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.

Problem

Research questions and friction points this paper is trying to address.

Medical Document OCR

Complex Layouts

Noisy Annotations

Field-level Exact Matching

Domain-specific Terminology

Innovation

Methods, ideas, or system contributions that make the work stand out.

Medical Document Parsing

Vision-Language Model

Noisy Annotation