DOGE: Towards Versatile Visual Document Grounding and Referring

📅 2024-11-26

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) face bottlenecks in fine-grained visual document localization and referring expression understanding, primarily due to the scarcity of high-quality fine-grained training data and systematic evaluation benchmarks. To address this, we propose DOGE-Engine—a novel document-level multi-granularity parsing and instruction-tuning data synthesis framework—and DOGE-Bench, the first fine-grained document grounding benchmark covering seven tasks across charts, posters, and PDFs. Technically, our approach integrates multi-granularity OCR parsing, instruction-driven synthetic data generation, vision-language joint alignment, and structure-aware annotation. We further release the first MLLM baseline explicitly designed for document visual referring. Evaluated on DOGE-Bench, our method comprehensively outperforms state-of-the-art approaches, achieving significant gains in text localization accuracy and cross-modal referring capability. All code, data, and models are publicly released.

Technology Category

Application Category

📝 Abstract

In recent years, Multimodal Large Language Models (MLLMs) have increasingly emphasized grounding and referring capabilities to achieve detailed understanding and flexible user interaction. However, in the realm of visual document understanding, these capabilities lag behind due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and Eferring data engine (DOGE-Engine), which produces two types of high-quality fine-grained document data: multi-granular parsing data for enhancing fundamental text localization and recognition capabilities; and instruction-tuning data to activate MLLM's grounding and referring capabilities during dialogue and reasoning. Additionally, using our engine, we construct DOGE-Bench, which encompasses 7 grounding and referring tasks across 3 document types (chart, poster, PDF document), providing comprehensive evaluations for fine-grained document understanding. Furthermore, leveraging the data generated by our engine, we develop a strong baseline model, DOGE. This pioneering MLLM is capable of accurately referring and grounding texts at multiple granularities within document images. Our code, data, and model will be open-sourced for community development.

Problem

Research questions and friction points this paper is trying to address.

Lack of fine-grained datasets for document grounding and referring

Underdeveloped grounding capabilities in visual document understanding

Need for comprehensive benchmarks in multimodal document analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multi-granular parsing data

Creates instruction-tuning data for MLLMs

Develops DOGR model for precise grounding

🔎 Similar Papers

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension