Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual document understanding suffers from challenges in multimodal fusion, weak spatial relationship modeling, and hallucinations arising from contextual insufficiency. To address these issues, this work proposes an adaptive multi-format markup language generation paradigm—supporting Markdown, JSON, HTML, and TiKZ—and constructs two fine-grained datasets: DocMark-Pile (3.8M samples) and DocMark-Instruct (624K samples). We design a vision–text joint structured modeling mechanism and introduce a context-aware instruction-tuning strategy. Our approach significantly enhances models’ spatial-semantic comprehension and reasoning over complex layouts. It achieves state-of-the-art performance across multiple visual document understanding benchmarks, surpassing existing multimodal large language models. Notably, it is the first method to enable controllable and interpretable document structure parsing driven by cross-format markup languages.

Technology Category

Application Category

📝 Abstract
Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following. Extensive experiments demonstrate that our proposed model significantly outperforms existing state-of-theart MLLMs across a range of visual document understanding benchmarks, facilitating advanced reasoning and comprehension capabilities in complex visual scenarios. Our code and models are released at https://github. com/Euphoria16/DocMark.
Problem

Research questions and friction points this paper is trying to address.

Integrating visual perception and textual comprehension in documents
Addressing lack of contextual data for robust document understanding
Improving spatial relationship comprehension in visual document elements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive markup language generation for documents
Fine-grained structured datasets for training
Outperforms state-of-the-art MLLMs in benchmarks
🔎 Similar Papers
No similar papers found.
H
Han Xiao
CUHK MMLab, vivo AI Lab
Y
Yina Xie
vivo AI Lab
G
Guanxin Tan
vivo AI Lab
Y
Yinghao Chen
vivo AI Lab
R
Rui Hu
vivo AI Lab
K
Ke Wang
CUHK MMLab
Aojun Zhou
Aojun Zhou
The Chinese University of Hong Kong
Deep Learning
H
Hao Li
CUHK MMLab
Hao Shao
Hao Shao
CUHK, MMLab
Large Language ModelsGenerative modelsAutonomous Driving
Xudong Lu
Xudong Lu
PhD student, the Chinese University of Hong Kong
Computer VisionMachine Learning
P
Peng Gao
Shanghai AI Lab & Shenzhen Institute of Advanced Technology, CAS
Y
Yafei Wen
vivo AI Lab
Xiaoxin Chen
Xiaoxin Chen
Coriell Institute for Medical Research
S
Shuai Ren
vivo AI Lab
H
Hongsheng Li
CUHK MMLab, CPII under InnoHK