π€ AI Summary
Open-source multimodal large language models (MLLMs) significantly underperform proprietary models (e.g., GPT-4V, Gemini) on text-centric visual question answering (VQA), primarily due to a scarcity of high-quality instruction-tuning data.
Method: We propose Square, a novel data construction paradigm leveraging closed-source MLLMs to synthesize Square-10Mβa 10-million-sample, high-fidelity text-centric visual instruction datasetβvia a four-stage pipeline: self-questioning, answer generation, structured reasoning chain construction, and multi-dimensional hallucination assessment.
Contribution/Results: We empirically discover an exponential scaling law linking text-centric VQA performance to instruction-data scale. We further demonstrate that explicit reasoning chains critically mitigate hallucination and enhance contextual understanding. The resulting TextSquare model achieves 62.2% on OCRBench and outperforms GPT-4V/Gemini on 6 out of 10 text-centric benchmarks. Its average score on general VQA and hallucination evaluation is 75.1%, substantially surpassing prior open-source state-of-the-art.
π Abstract
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.