🤖 AI Summary
Medical document OCR remains challenging due to complex layouts, domain-specific terminology, and noisy annotations, leading to limited performance from both conventional OCR systems and general-purpose vision-language models in achieving field-level parsing accuracy. To address this, this work proposes MeDocVL, a query-driven vision-language model tailored for medical documents. MeDocVL introduces a novel training-driven label refinement mechanism coupled with a noise-aware hybrid post-training strategy that integrates reinforcement learning and supervised fine-tuning. This approach significantly enhances robustness and parsing precision under noisy supervision. Evaluated on a medical invoice benchmark, MeDocVL substantially outperforms existing OCR systems and strong vision-language baselines, establishing state-of-the-art performance in the presence of annotation noise.
📝 Abstract
Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.