🤖 AI Summary
This study addresses the low-resource and cultural misalignment challenges in large language model (LLM)-based translation of traditional recipes into ten endangered Indigenous Indian languages (Eastern group). We construct the first multimodal dataset comprising 1,060 authentic recipes, crowdsourced from low-digital-literacy communities to preserve both culinary practices and their sociocultural context. Methodologically, we integrate endangered-language preservation with indigenous food-culture transmission, proposing a culturally sensitive translation evaluation framework. We design a lightweight mobile interface for data collection and enhance LLM translation via context augmentation techniques—including culturally grounded prompts and in-context examples. Experimental results show that mainstream LLMs exhibit substantially degraded translation quality without contextual support; however, injecting language- and culture-specific context yields significant improvements in both BLEU scores and human evaluation metrics. These findings empirically validate the critical role of cultural adaptation in improving translation performance for low-resource languages.
📝 Abstract
We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models' capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.