NushuRescue: Revitalization of the Endangered Nushu Language with AI

📅 2024-11-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Endangered writing systems—such as the Nüshu script of China’s Yao ethnic group—suffer from extreme data scarcity and prohibitively high manual reconstruction costs. Method: This study proposes the first large-language-model–driven framework for low-resource language reconstruction. We construct NCGold, the first 500-sentence Nüshu–Chinese parallel corpus, and achieve effective few-shot fine-tuning of GPT-4-Turbo using only 35 annotated examples. To augment data, we integrate FastText word embeddings with a Seq2Seq translation model to generate high-quality synthetic data (NCSilver). Contribution/Results: On a 50-sentence test set, our approach achieves a 48.69% translation accuracy and produces 98 validated Nüshu translations aligned with modern Chinese. We release the first open-source foundational Nüshu corpus and an accompanying model toolkit, substantially reducing reliance on expert annotation and establishing a reusable methodology for the revitalization of endangered scripts.

Technology Category

Application Category

📝 Abstract
The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by Nushu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce NushuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. NushuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence Nushu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to Nushu and only 35 short examples from NCGold, NushuRescue achieved 48.69% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. A sample of both NCGold and NCSilver is included in the Supplementary Materials. Additionally, we developed FastText-based and Seq2Seq models to further support research on Nushu. NushuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input.
Problem

Research questions and friction points this paper is trying to address.

Artificial Intelligence
Endangered Languages
Cultural Preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-driven language preservation
NCGold corpus
GPT-4-Turbo translation
🔎 Similar Papers
No similar papers found.