KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

A lack of comprehensive, text-rich visual question answering (VQA) benchmarks hinders evaluation of vision-language models (VLMs) for low-resource languages such as Korean. Method: We introduce KRETA, the first fine-grained Korean VQA benchmark, covering 15 domains and 26 image types to systematically assess Korean text comprehension and multi-step reasoning in complex visual scenes. Our scalable semi-automated pipeline integrates progressive image decomposition, OCR-enhanced text–vision alignment, large language model–driven question generation and filtering, a seven-dimensional quality control framework, and multi-round human verification. Contribution/Results: KRETA is the largest open-source Korean VQA dataset to date. Its code and data are publicly released on GitHub, substantially bridging the multilingual multimodal evaluation gap and enabling fair, cross-lingual VLM assessment and sustained advancement.

Technology Category

Application Category

📝 Abstract

Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research. The code and dataset for KRETA are available at https://github.com/tabtoyou/KRETA.

Problem

Research questions and friction points this paper is trying to address.

Addresses lack of Korean text-rich VQA benchmarks

Evaluates visual text understanding and reasoning capabilities

Provides domain-diverse evaluation across 15 categories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automated VQA generation pipeline

Stepwise image decomposition method

Seven-metric evaluation protocol

🔎 Similar Papers

Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration