Docs2Synth: A Synthetic Data Trained Retriever Framework for Scanned Visually Rich Documents Understanding

📅 2026-01-18

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This work addresses the poor adaptability and hallucination issues of multimodal models in regulated domains, where scanned visually rich documents suffer from a lack of human annotations and difficulty in updating domain knowledge. To overcome these challenges, we propose a retrieval-guided reasoning framework based on synthetic supervision. Our approach introduces a novel annotation-free mechanism that leverages an agent system to automatically generate and verify high-quality question-answer pairs, which are then used to train a lightweight visual retriever. This retriever operates in concert with a multimodal large language model through an iterative retrieve-and-generate loop. Evaluated on multiple VRDU benchmarks, the method significantly improves domain generalization and factual consistency while effectively mitigating hallucinations, and supports plug-and-play deployment without requiring manual labeling.

Technology Category

Application Category

📝 Abstract

Document understanding (VRDU) in regulated domains is particularly challenging, since scanned documents often contain sensitive, evolving, and domain specific knowledge. This leads to two major challenges: the lack of manual annotations for model adaptation and the difficulty for pretrained models to stay up-to-date with domain-specific facts. While Multimodal Large Language Models (MLLMs) show strong zero-shot abilities, they still suffer from hallucination and limited domain grounding. In contrast, discriminative Vision-Language Pre-trained Models (VLPMs) provide reliable grounding but require costly annotations to cover new domains. We introduce Docs2Synth, a synthetic-supervision framework that enables retrieval-guided inference for private and low-resource domains. Docs2Synth automatically processes raw document collections, generates and verifies diverse QA pairs via an agent-based system, and trains a lightweight visual retriever to extract domain-relevant evidence. During inference, the retriever collaborates with an MLLM through an iterative retrieval--generation loop, reducing hallucination and improving response consistency. We further deliver Docs2Synth as an easy-to-use Python package, enabling plug-and-play deployment across diverse real-world scenarios. Experiments on multiple VRDU benchmarks show that Docs2Synth substantially enhances grounding and domain generalization without requiring human annotations.

Problem

Research questions and friction points this paper is trying to address.

Visually Rich Document Understanding

Synthetic Data

Domain Adaptation

Hallucination

Annotation Scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data

retrieval-augmented generation

vision-language models