Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino

📅 2024-09-20
🏛️ Pacific Asia Conference on Language, Information and Computation
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing multilingual large language models (LLMs) exhibit significant cultural misalignment in the Philippine context. Method: We introduce the first culturally grounded LLM evaluation suite for the Philippines, co-designed by native speakers and comprising 150 handcrafted prompts embedding local values, customs, and commonsense knowledge. We propose a “cultural alignment–first” evaluation paradigm—prioritizing culturally appropriate reasoning over mere linguistic correctness—and establish a human baseline (89.10%) from Filipino participants. Evaluation employs manual prompt engineering, explicit cultural knowledge modeling, cross-model comparative benchmarking, and human consistency validation across state-of-the-art multilingual and Filipino-language LLMs. Contribution/Results: The best-performing model achieves only 46.0% accuracy—substantially below the human baseline—demonstrating the suite’s efficacy in exposing cultural representation gaps. This work provides a reproducible methodology and open benchmark resource for cross-cultural AI assessment.

Technology Category

Application Category

📝 Abstract
Multilingual large language models (LLMs) today may not necessarily provide culturally appropriate and relevant responses to its Filipino users. We introduce Kalahi, a cultural LLM evaluation suite collaboratively created by native Filipino speakers. It is composed of 150 high-quality, handcrafted and nuanced prompts that test LLMs for generations that are relevant to shared Filipino cultural knowledge and values. Strong LLM performance in Kalahi indicates a model's ability to generate responses similar to what an average Filipino would say or do in a given situation. We conducted experiments on LLMs with multilingual and Filipino language support. Results show that Kalahi, while trivial for Filipinos, is challenging for LLMs, with the best model answering only 46.0% of the questions correctly compared to native Filipino performance of 89.10%. Thus, Kalahi can be used to accurately and reliably evaluate Filipino cultural representation in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating cultural relevance of multilingual LLMs for Filipinos
Assessing LLM responses against Filipino cultural knowledge
Measuring Filipino cultural representation accuracy in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Handcrafted Filipino cultural evaluation suite
150 nuanced prompts testing cultural relevance
Evaluates LLM responses against native Filipino standards
🔎 Similar Papers
No similar papers found.
J
Jann Railey Montalan
AI Singapore, National University of Singapore
J
Jian Gang Ngui
AI Singapore, National University of Singapore
W
Wei Qi Leong
AI Singapore, National University of Singapore
Y
Yosephine Susanto
AI Singapore, National University of Singapore
H
Hamsawardhini Rengarajan
AI Singapore, National University of Singapore
Alham Fikri Aji
Alham Fikri Aji
MBZUAI, Monash Indonesia
MultilingualityLow-resource NLPLanguage ModelingMachine Translation
W
William Chandra Tjhi
AI Singapore, National University of Singapore