DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the longstanding scarcity of high-quality, large-scale resources for Japanese vision-and-language (V&L) modeling, this work introduces two massive Japanese multimodal datasets—DEJIMA-Cap and DEJIMA-VQA—each containing 3.88 million image-text pairs. We propose a scalable, reproducible end-to-end construction pipeline: (1) collecting raw image-text pairs via web crawling with rigorous deduplication; (2) leveraging object detection models to localize visual evidence for grounded text generation; and (3) employing grounding-constrained large language models to optimize cultural appropriateness and linguistic naturalness. Experiments demonstrate that our datasets significantly outperform translation- or human-annotation-based baselines in Japanese fluency, cultural representativeness, and cross-modal alignment quality, leading to consistent performance gains across multiple Japanese multimodal benchmarks. The full datasets are publicly released under a commercial-use license.

Technology Category

Application Category

📝 Abstract

This work addresses the scarcity of high-quality, large-scale resources for Japanese Vision-and-Language (V&L) modeling. We present a scalable and reproducible pipeline that integrates large-scale web collection with rigorous filtering/deduplication, object-detection-driven evidence extraction, and Large Language Model (LLM)-based refinement under grounding constraints. Using this pipeline, we build two resources: an image-caption dataset (DEJIMA-Cap) and a VQA dataset (DEJIMA-VQA), each containing 3.88M image-text pairs, far exceeding the size of existing Japanese V&L datasets. Human evaluations demonstrate that DEJIMA achieves substantially higher Japaneseness and linguistic naturalness than datasets constructed via translation or manual annotation, while maintaining factual correctness at a level comparable to human-annotated corpora. Quantitative analyses of image feature distributions further confirm that DEJIMA broadly covers diverse visual domains characteristic of Japan, complementing its linguistic and cultural representativeness. Models trained on DEJIMA exhibit consistent improvements across multiple Japanese multimodal benchmarks, confirming that culturally grounded, large-scale resources play a key role in enhancing model performance. All data sources and modules in our pipeline are licensed for commercial use, and we publicly release the resulting dataset and metadata to encourage further research and industrial applications in Japanese V&L modeling.

Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of large-scale Japanese vision-language datasets

Builds culturally representative image-caption and VQA datasets for Japan

Enhances model performance on Japanese multimodal tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable pipeline integrates web collection, filtering, and LLM refinement

Object-detection-driven evidence extraction ensures factual grounding

Dataset achieves high Japaneseness and cultural representativeness via native curation

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis