TextSquare: Scaling up Text-Centric Visual Instruction Tuning

📅 2024-04-19

🏛️ arXiv.org

📈 Citations: 26

✨ Influential: 2

career value

198K/year

🤖 AI Summary

Open-source multimodal large language models (MLLMs) significantly underperform proprietary models (e.g., GPT-4V, Gemini) on text-centric visual question answering (VQA), primarily due to a scarcity of high-quality instruction-tuning data. Method: We propose Square, a novel data construction paradigm leveraging closed-source MLLMs to synthesize Square-10M—a 10-million-sample, high-fidelity text-centric visual instruction dataset—via a four-stage pipeline: self-questioning, answer generation, structured reasoning chain construction, and multi-dimensional hallucination assessment. Contribution/Results: We empirically discover an exponential scaling law linking text-centric VQA performance to instruction-data scale. We further demonstrate that explicit reasoning chains critically mitigate hallucination and enhance contextual understanding. The resulting TextSquare model achieves 62.2% on OCRBench and outperforms GPT-4V/Gemini on 6 out of 10 text-centric benchmarks. Its average score on general VQA and hallucination evaluation is 75.1%, substantially surpassing prior open-source state-of-the-art.

Technology Category

Application Category

📝 Abstract

Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

Problem

Research questions and friction points this paper is trying to address.

Addressing lack of high-quality text-centric VQA instruction data

Improving multimodal models' text understanding and reasoning capabilities

Scaling instruction data to enhance performance and reduce hallucinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generated massive dataset using closed-source MLLMs

Four-step data construction: Questioning, Answering, Reasoning, Evaluation

Exponential data scaling proportional to performance improvement

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment