IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Indonesian—spoken by over 200 million people—is severely underrepresented in preference-based large language model (LLM) research; existing multilingual preference datasets are predominantly machine-translated from English, compromising linguistic authenticity and cultural relevance. To address this gap, we introduce IDP (Indonesian Dialogue Preference), the first fully human-written, multi-domain, natively Indonesian pairwise preference dataset. IDP spans six domains—including daily dialogue, news, and literature—with all samples authored de novo by native speakers. Rigorous double-blind annotation yields high inter-annotator agreement (Krippendorff’s alpha = 0.87). Using IDP, we conduct a systematic evaluation of preference modeling across 12 state-of-the-art LLMs. This work establishes the first culturally grounded, reproducible benchmark for preference learning in low-resource languages, directly advancing alignment research for under-resourced linguistic communities.

Technology Category

Application Category

📝 Abstract

Over 200 million people speak Indonesian, yet the language remains significantly underrepresented in preference-based research for large language models (LLMs). Most existing multilingual datasets are derived from English translations, often resulting in content that lacks cultural and linguistic authenticity. To address this gap, we introduce IndoPref, the first fully human-authored and multi-domain Indonesian preference dataset specifically designed to evaluate the naturalness and quality of LLM-generated text. All annotations are natively written in Indonesian and evaluated using Krippendorff's alpha, demonstrating strong inter-annotator agreement. Additionally, we benchmark the dataset across multiple LLMs and assess the output quality of each model.

Problem

Research questions and friction points this paper is trying to address.

Underrepresentation of Indonesian in LLM preference research

Lack of authentic Indonesian datasets from English translations

Need for native multi-domain Indonesian LLM evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

First human-authored Indonesian preference dataset

Multi-domain naturalness and quality evaluation

Native annotations with strong agreement metrics

🔎 Similar Papers

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages