Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration

πŸ“… 2024-06-24
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 6
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing cultural evaluation benchmarks suffer from heavy reliance on manual curation, narrow cultural coverage, and low construction efficiency. To address these limitations, this paper introduces K-Viscuitβ€”the first vision-language multiple-choice question benchmark explicitly designed for Korean culture. Methodologically, we propose a novel semi-automated human-in-the-loop framework: leveraging vision-language model (VLM) prompting and few-shot guided generation to produce culturally grounded questions, followed by native Korean speaker validation, external knowledge augmentation, and a multi-dimensional human evaluation protocol. Our contributions are threefold: (1) the open-sourced K-Viscuit dataset, publicly available on Hugging Face; (2) the first systematic empirical finding that mainstream open-source VLMs significantly underperform closed-source counterparts in East Asian cultural understanding; and (3) a flexible, transferable evaluation paradigm supporting both multiple-choice and open-ended cultural comprehension assessment.

Technology Category

Application Category

πŸ“ Abstract
To create culturally inclusive vision-language models (VLMs), developing a benchmark that tests their ability to address culturally relevant questions is essential. Existing approaches typically rely on human annotators, making the process labor-intensive and creating a cognitive burden in generating diverse questions. To address this, we propose a semi-automated framework for constructing cultural VLM benchmarks, specifically targeting multiple-choice QA. This framework combines human-VLM collaboration, where VLMs generate questions based on guidelines, a small set of annotated examples, and relevant knowledge, followed by a verification process by native speakers. We demonstrate the effectiveness of this framework through the creation of exttt{K-Viscuit}, a dataset focused on Korean culture. Our experiments on this dataset reveal that open-source models lag behind proprietary ones in understanding Korean culture, highlighting key areas for improvement. We also present a series of further analyses, including human evaluation, augmenting VLMs with external knowledge, and the evaluation beyond multiple-choice QA. Our dataset is available at https://huggingface.co/datasets/ddehun/k-viscuit.
Problem

Research questions and friction points this paper is trying to address.

Developing culturally inclusive vision-language models (VLMs)
Reducing labor-intensive human annotation for diverse questions
Assessing VLMs' understanding of Korean culture via K-Viscuit
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automated framework for cultural VLM benchmarks
Human-VLM collaboration for question generation
Native speaker verification for cultural accuracy
πŸ”Ž Similar Papers
No similar papers found.