Towards Visual Text Grounding of Multimodal Large Language Model

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit weak visual grounding capabilities for textual elements in text-dense images—such as documents, tables, and infographics—hindering precise spatial understanding. Method: This work introduces Text-Rich Image Grounding (TRIG), a novel task for fine-grained visual-text localization in such images, and establishes the first dedicated benchmark—TRIG—along with a corresponding dataset. Leveraging an OCR-LLM-human collaborative annotation paradigm, we generate 800 high-quality expert-annotated question-answer pairs and 90K synthetic samples. We further propose a lightweight instruction-tuning strategy and a plug-and-play spatial-aware embedding module to jointly align textual content with layout coordinates. Contribution/Results: Experiments demonstrate substantial improvements in MLLMs’ spatial reasoning and fine-grained text localization accuracy on document images. The TRIG benchmark provides a reproducible evaluation framework and a principled technical pathway for advancing rich-text visual understanding.

Technology Category

Application Category

📝 Abstract
Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90$ synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.
Problem

Research questions and friction points this paper is trying to address.

MLLMs struggle with visual text grounding in text-rich document images
Current benchmarks lack focus on text-rich document image challenges
Proposes TRIG task to improve MLLMs' text-rich image grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

OCR-LLM-human pipeline for annotation
Synthetic dataset for training MLLMs
Plug-and-play embedding for text grounding
🔎 Similar Papers
No similar papers found.
M
Ming Li
University of Maryland
R
Ruiyi Zhang
Adobe Research
J
Jian Chen
University at Buffalo
Jiuxiang Gu
Jiuxiang Gu
Adobe Research
Computer VisionNatural Language ProcessingMachine Learning
Y
Yufan Zhou
Adobe Research
Franck Dernoncourt
Franck Dernoncourt
NLP/ML Researcher. MIT PhD.
Machine LearningNeural NetworksNatural Language Processing
Wanrong Zhu
Wanrong Zhu
Adobe Research
Vision and LanguageNatural Language Processing
T
Tianyi Zhou
University of Maryland
T
Tong Sun
Adobe Research