🤖 AI Summary
This work addresses the poor adaptability and hallucination issues of multimodal models in regulated domains, where scanned visually rich documents suffer from a lack of human annotations and difficulty in updating domain knowledge. To overcome these challenges, we propose a retrieval-guided reasoning framework based on synthetic supervision. Our approach introduces a novel annotation-free mechanism that leverages an agent system to automatically generate and verify high-quality question-answer pairs, which are then used to train a lightweight visual retriever. This retriever operates in concert with a multimodal large language model through an iterative retrieve-and-generate loop. Evaluated on multiple VRDU benchmarks, the method significantly improves domain generalization and factual consistency while effectively mitigating hallucinations, and supports plug-and-play deployment without requiring manual labeling.
📝 Abstract
Document understanding (VRDU) in regulated domains is particularly challenging, since scanned documents often contain sensitive, evolving, and domain specific knowledge. This leads to two major challenges: the lack of manual annotations for model adaptation and the difficulty for pretrained models to stay up-to-date with domain-specific facts. While Multimodal Large Language Models (MLLMs) show strong zero-shot abilities, they still suffer from hallucination and limited domain grounding. In contrast, discriminative Vision-Language Pre-trained Models (VLPMs) provide reliable grounding but require costly annotations to cover new domains. We introduce Docs2Synth, a synthetic-supervision framework that enables retrieval-guided inference for private and low-resource domains. Docs2Synth automatically processes raw document collections, generates and verifies diverse QA pairs via an agent-based system, and trains a lightweight visual retriever to extract domain-relevant evidence. During inference, the retriever collaborates with an MLLM through an iterative retrieval--generation loop, reducing hallucination and improving response consistency. We further deliver Docs2Synth as an easy-to-use Python package, enabling plug-and-play deployment across diverse real-world scenarios. Experiments on multiple VRDU benchmarks show that Docs2Synth substantially enhances grounding and domain generalization without requiring human annotations.