DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large models exhibit strong localization capability in Document Visual Question Answering (DocVQA) but incur prohibitive deployment costs, whereas lightweight student models suffer from severe deficits in spatial reasoning. To bridge this gap, we propose Verification-based Chain-of-Thought Distillation (V-CoT), a novel distillation framework that employs a multi-module verifier (VAL) to provide pixel-level localization feedback. V-CoT further integrates OCR-free text detection for noise suppression, two-stage training, and iterative refinement using high-quality chain-of-thought (CoT) examples to enhance geometric consistency and answer accuracy in student models. Without requiring OCR inference, our Gemma-3 12B student model achieves 91.4% ANLS and 82.4% mAP—substantially outperforming baseline methods. Additionally, we publicly release a high-quality CoT dataset comprising 95,000 samples to advance lightweight DocVQA research.

Technology Category

Application Category

📝 Abstract
Document visual question answering (DocVQA) requires models to jointly reason over textual content and spatial layout, yet current systems exhibit a sharp accuracy--efficiency trade-off: large teacher models achieve strong grounding but are too expensive for deployment, while compact students suffer substantial drops in localization performance. We propose DocVAL, a validated chain-of-thought distillation framework that transfers the spatial reasoning ability of a large teacher into a deployable student VLM through three key components: (1) teacher supervision with validation-time text detection to filter and denoise training signals, (2) a multi-module validator (VAL) that enforces answer correctness and geometric consistency while producing fine-grained, pixel-level error feedback, and (3) a two-stage student training scheme that first learns from validated CoT traces and then undergoes iterative refinement driven by VAL feedback. Our student (Gemma-3 12B) achieves 91.4% ANLS and 82.4% mAP on DocVQA as a pure VLM requiring no text detection or OCR at inference. Extensive ablations demonstrate that validated feedback contributes 6.3 mAP gain and iterative refinement accounts for 9.7 mAP improvement. We release 95k high-quality, validator-verified CoT traces to advance spatial reasoning research in document understanding.
Problem

Research questions and friction points this paper is trying to address.

Transfer spatial reasoning from large teacher to compact student model
Improve localization accuracy in document visual question answering
Enable efficient deployment without text detection or OCR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Validated chain-of-thought distillation transfers teacher spatial reasoning
Multi-module validator ensures answer correctness and geometric consistency
Two-stage student training uses validated CoT traces and iterative refinement
🔎 Similar Papers
No similar papers found.