A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

In scientific visual question answering (VQA), existing multimodal models exhibit weak zero-shot comprehension of scientific charts and their associated textual annotations, primarily due to the absence of training data in the “text-in-image” format. To address this, we propose a lightweight and efficient data augmentation strategy that automatically synthesizes disjoint image–text pairs into unified, text-embedded images, thereby constructing a novel scientific VQA training dataset. This work is the first to systematically enable large-scale synthesis of fused image–text representations. The synthesized data supports fine-tuning of multilingual multimodal models and joint image–text representation learning. Evaluated across 13 languages, our approach significantly improves zero-shot cross-lingual transfer performance, yielding an average accuracy gain of 4.2%. It establishes a scalable, low-resource-dependent paradigm for scientific chart understanding.

Technology Category

Application Category

📝 Abstract

Scientific visual question answering poses significant challenges for vision-language models due to the complexity of scientific figures and their multimodal context. Traditional approaches treat the figure and accompanying text (e.g., questions and answer options) as separate inputs. EXAMS-V introduced a new paradigm by embedding both visual and textual content into a single image. However, even state-of-the-art proprietary models perform poorly on this setup in zero-shot settings, underscoring the need for task-specific fine-tuning. To address the scarcity of training data in this "text-in-image" format, we synthesize a new dataset by converting existing separate image-text pairs into unified images. Fine-tuning a small multilingual multimodal model on a mix of our synthetic data and EXAMS-V yields notable gains across 13 languages, demonstrating strong average improvements and cross-lingual transfer.

Problem

Research questions and friction points this paper is trying to address.

Addressing scientific VQA challenges from complex multimodal figures

Overcoming data scarcity in text-in-image scientific question answering

Improving multilingual model performance on embedded text-image content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesized dataset from image-text pairs

Fine-tuned multilingual multimodal model

Embedded visual and textual content into images

🔎 Similar Papers

No similar papers found.