TextSquare: Scaling up Text-Centric Visual Instruction Tuning

πŸ“… 2024-04-19
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 26
✨ Influential: 2
πŸ“„ PDF
πŸ€– AI Summary
Open-source multimodal large language models (MLLMs) significantly underperform proprietary models (e.g., GPT-4V, Gemini) on text-centric visual question answering (VQA), primarily due to a scarcity of high-quality instruction-tuning data. Method: We propose Square, a novel data construction paradigm leveraging closed-source MLLMs to synthesize Square-10Mβ€”a 10-million-sample, high-fidelity text-centric visual instruction datasetβ€”via a four-stage pipeline: self-questioning, answer generation, structured reasoning chain construction, and multi-dimensional hallucination assessment. Contribution/Results: We empirically discover an exponential scaling law linking text-centric VQA performance to instruction-data scale. We further demonstrate that explicit reasoning chains critically mitigate hallucination and enhance contextual understanding. The resulting TextSquare model achieves 62.2% on OCRBench and outperforms GPT-4V/Gemini on 6 out of 10 text-centric benchmarks. Its average score on general VQA and hallucination evaluation is 75.1%, substantially surpassing prior open-source state-of-the-art.

Technology Category

Application Category

πŸ“ Abstract
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.
Problem

Research questions and friction points this paper is trying to address.

Addressing lack of high-quality text-centric VQA instruction data
Improving multimodal models' text understanding and reasoning capabilities
Scaling instruction data to enhance performance and reduce hallucinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generated massive dataset using closed-source MLLMs
Four-step data construction: Questioning, Answering, Reasoning, Evaluation
Exponential data scaling proportional to performance improvement
πŸ”Ž Similar Papers
No similar papers found.
Jingqun Tang
Jingqun Tang
ByteDance Inc.
Computer VisionDocument IntelligenceMLLMMultimodal Generative Models
C
Chunhui Lin
ByteDance
Z
Zhen Zhao
East China Normal University
S
Shubo Wei
ByteDance
B
Binghong Wu
ByteDance
Q
Qi Liu
ByteDance
H
Hao Feng
ByteDance
Y
Yang Li
ByteDance
S
Siqi Wang
ByteDance
Lei Liao
Lei Liao
ByteDance Inc.
W
Wei Shi
ByteDance
Y
Yuliang Liu
Huazhong University of Science and Technology
H
Hao Liu
ByteDance
Y
Yuan Xie
East China Normal University
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR
C
Can Huang
ByteDance