🤖 AI Summary
Existing cultural question-answering benchmarks are largely confined to single-hop questions, limiting their ability to evaluate large language models’ capacity for deep reasoning about complex cultural contexts. This work proposes ID-MoCQA, the first multi-hop question-answering dataset focused on Indonesian culture, which systematically transforms single-hop questions into multi-hop reasoning chains incorporating six types of contextual clues—commonsense, temporal, geographic, and others. The dataset is rigorously validated through a multi-stage verification pipeline, expert review, and an LLM-as-a-judge mechanism, and supports both English and Indonesian. Experiments on mainstream large language models reveal significant performance gaps in scenarios requiring nuanced cultural inference. ID-MoCQA thus fills a critical gap in cultural understanding evaluation and establishes a new benchmark for assessing models’ cultural reasoning capabilities.
📝 Abstract
Understanding culture requires reasoning across context, tradition, and implicit social knowledge, far beyond recalling isolated facts. Yet most culturally focused question answering (QA) benchmarks rely on single-hop questions, which may allow models to exploit shallow cues rather than demonstrate genuine cultural reasoning. In this work, we introduce ID-MoCQA, the first large-scale multi-hop QA dataset for assessing the cultural understanding of large language models (LLMs), grounded in Indonesian traditions and available in both English and Indonesian. We present a new framework that systematically transforms single-hop cultural questions into multi-hop reasoning chains spanning six clue types (e.g., commonsense, temporal, geographical). Our multi-stage validation pipeline, combining expert review and LLM-as-a-judge filtering, ensures high-quality question-answer pairs. Our evaluation across state-of-the-art models reveals substantial gaps in cultural reasoning, particularly in tasks requiring nuanced inference. ID-MoCQA provides a challenging and essential benchmark for advancing the cultural competency of LLMs.