ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the low-resource and cultural misalignment challenges in large language model (LLM)-based translation of traditional recipes into ten endangered Indigenous Indian languages (Eastern group). We construct the first multimodal dataset comprising 1,060 authentic recipes, crowdsourced from low-digital-literacy communities to preserve both culinary practices and their sociocultural context. Methodologically, we integrate endangered-language preservation with indigenous food-culture transmission, proposing a culturally sensitive translation evaluation framework. We design a lightweight mobile interface for data collection and enhance LLM translation via context augmentation techniques—including culturally grounded prompts and in-context examples. Experimental results show that mainstream LLMs exhibit substantially degraded translation quality without contextual support; however, injecting language- and culture-specific context yields significant improvements in both BLEU scores and human evaluation metrics. These findings empirically validate the critical role of cultural adaptation in improving translation performance for low-resource languages.

Technology Category

Application Category

📝 Abstract
We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models' capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.
Problem

Research questions and friction points this paper is trying to address.

Creates a dataset of endangered language recipes
Evaluates LLMs on translating low-resource cultural content
Proposes benchmarks for equitable language technologies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mobile interface for low digital literacy contributors
Targeted context improves translation of endangered languages
Multimodal dataset captures socio-cultural culinary practices
🔎 Similar Papers
No similar papers found.
N
Neha Joshi
Karya
Pamir Gogoi
Pamir Gogoi
PhD, University of Florida
LinguisticsPhoneticsVoice QualityNasality
A
Aasim Mirza
Karya
A
Aayush Jansari
Karya
A
Aditya Yadavalli
UC San Diego
Ayushi Pandey
Ayushi Pandey
Trinity College Dublin
text-to-speech evaluationcomputational linguisticsacoustic-phoneticsspeech perception
A
Arunima Shukla
Karya
D
Deepthi Sudharsan
Independent Researcher
K
Kalika Bali
Microsoft Corporation
Vivek Seshadri
Vivek Seshadri
Student, CS, CMU
Computer Architecture