🤖 AI Summary
Endangered writing systems—such as the Nüshu script of China’s Yao ethnic group—suffer from extreme data scarcity and prohibitively high manual reconstruction costs. Method: This study proposes the first large-language-model–driven framework for low-resource language reconstruction. We construct NCGold, the first 500-sentence Nüshu–Chinese parallel corpus, and achieve effective few-shot fine-tuning of GPT-4-Turbo using only 35 annotated examples. To augment data, we integrate FastText word embeddings with a Seq2Seq translation model to generate high-quality synthetic data (NCSilver). Contribution/Results: On a 50-sentence test set, our approach achieves a 48.69% translation accuracy and produces 98 validated Nüshu translations aligned with modern Chinese. We release the first open-source foundational Nüshu corpus and an accompanying model toolkit, substantially reducing reliance on expert annotation and establishing a reusable methodology for the revitalization of endangered scripts.
📝 Abstract
The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by Nushu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce NushuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. NushuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence Nushu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to Nushu and only 35 short examples from NCGold, NushuRescue achieved 48.69% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. A sample of both NCGold and NCSilver is included in the Supplementary Materials. Additionally, we developed FastText-based and Seq2Seq models to further support research on Nushu. NushuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input.