Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the challenge of automating interlinear glossed text (IGT) annotation for morphologically rich but resource-scarce endangered languages, such as Tsuut’ina. The authors propose a two-stage hybrid pipeline that combines neural sequence labeling with large language model (LLM)-based post-correction. An initial morphological annotation is generated using a BiLSTM-CRF model, followed by structured refinement via a retrieval-augmented few-shot prompting strategy that guides the LLM. Empirical results demonstrate that retrieval-based example selection significantly outperforms random sampling, while the incorporation of lexical resources typically degrades performance. The proposed approach achieves substantial gains in annotation accuracy while remaining lightweight and efficient, markedly reducing the manual effort required for linguistic documentation and offering a scalable new paradigm for processing low-resource languages.

Technology Category

Application Category

📝 Abstract

Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.

Problem

Research questions and friction points this paper is trying to address.

Interlinear Glossed Text

endangered language documentation

morphological glossing

low-resource languages

linguistic annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid neural-LLM pipeline

morphological glossing

retrieval-augmented prompting