ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This study addresses the low-resource and cultural misalignment challenges in large language model (LLM)-based translation of traditional recipes into ten endangered Indigenous Indian languages (Eastern group). We construct the first multimodal dataset comprising 1,060 authentic recipes, crowdsourced from low-digital-literacy communities to preserve both culinary practices and their sociocultural context. Methodologically, we integrate endangered-language preservation with indigenous food-culture transmission, proposing a culturally sensitive translation evaluation framework. We design a lightweight mobile interface for data collection and enhance LLM translation via context augmentation techniques—including culturally grounded prompts and in-context examples. Experimental results show that mainstream LLMs exhibit substantially degraded translation quality without contextual support; however, injecting language- and culture-specific context yields significant improvements in both BLEU scores and human evaluation metrics. These findings empirically validate the critical role of cultural adaptation in improving translation performance for low-resource languages.

Technology Category

Application Category

📝 Abstract

We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models' capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.

Problem

Research questions and friction points this paper is trying to address.

Creates a dataset of endangered language recipes

Evaluates LLMs on translating low-resource cultural content

Proposes benchmarks for equitable language technologies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mobile interface for low digital literacy contributors

Targeted context improves translation of endangered languages

Multimodal dataset captures socio-cultural culinary practices

🔎 Similar Papers

No similar papers found.