SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing chit-chat datasets largely overlook indigenous cultural contexts and linguistic diversity in multicultural regions such as Southeast Asia, hindering culturally grounded dialogue modeling. To address this, we propose the first culture-aware multi-turn dialogue construction framework specifically designed for Southeast Asia, covering six countries and eight languages—including several low-resource ones—and integrating fine-grained persona annotation with locally grounded life-topic prompting to ensure cultural representativeness and linguistic authenticity. We release SE-Dialog, a large-scale multilingual dialogue dataset that systematically unifies three key dimensions: cultural context, multi-turn interaction structure, and persona-based modeling. Empirical evaluation demonstrates substantial improvements in models’ understanding of and generation within regional cultural contexts. SE-Dialog establishes critical infrastructure for low-resource multilingual dialogue systems and culturally adaptive large language model research.

Technology Category

Application Category

📝 Abstract

Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents.

Problem

Research questions and friction points this paper is trying to address.

Addressing cultural gaps in multilingual dialogue datasets

Providing culturally grounded dialogues for Southeast Asian languages

Supporting research on human-centric culturally aware language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual dialogue dataset for Southeast Asia

Culturally grounded topics and personas

Supports human-centric large language models

🔎 Similar Papers

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages