ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding

๐Ÿ“… 2024-11-07
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the lack of evaluation standards for cultural understanding in low-resource languages, this paper introduces ProverbEvalโ€”the first culturally grounded benchmark for proverb understanding and generation. We construct a multilingual proverb dataset, design controlled-variable experiments, and systematically evaluate state-of-the-art multilingual large language models (MLLMs). Our empirical analysis reveals three key findings: (1) multiple-choice option ordering induces up to 50% performance variance; (2) native-language prompting significantly improves generation fidelity; and (3) monolingual evaluation substantially outperforms cross-lingual transfer on culture-sensitive tasks. We publicly release the dataset, evaluation code, and standardized pipeline (on Hugging Face and GitHub), establishing a reproducible, scalable methodological foundation and empirical evidence for assessing cultural cognition capabilities in low-resource languages.

Technology Category

Application Category

๐Ÿ“ Abstract
With the rapid development of evaluation datasets to assess LLMs understanding across a wide range of subjects and domains, identifying a suitable language understanding benchmark has become increasingly challenging. In this work, we explore LLM evaluation challenges for low-resource language understanding and introduce proverbeval, LLM evaluation benchmark for low-resource languages, focusing on low-resource language understanding in culture-specific scenarios. We benchmark various LLMs and explore factors that create variability in the benchmarking process. We observed performance variances of up to 50%, depending on the order in which answer choices were presented in multiple-choice tasks. Native language proverb descriptions significantly improve tasks such as proverb generation, contributing to improved outcomes. Additionally, monolingual evaluations consistently outperformed their cross-lingual counterparts in generation tasks. We argue that special attention must be given to the order of choices, the choice of prompt language, task variability, and generation tasks when creating LLM evaluation benchmarks. Evaluation data available at https://huggingface.co/datasets/israel/ProverbEval, evaluation code https://github.com/EthioNLP/EthioProverbEval.
Problem

Research questions and friction points this paper is trying to address.

LLM evaluation challenges
low-resource language understanding
culture-specific scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM evaluation benchmark
Low-resource language understanding
Native language proverb descriptions
๐Ÿ”Ž Similar Papers
No similar papers found.