Is It Navajo? Accurate Language Detection in Endangered Athabaskan Languages

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Many endangered Athabaskan languages—such as Navajo—suffer from severe underrepresentation in modern language technologies, resulting in poor identification accuracy and limited digital preservation. Method: We propose a lightweight, character-level n-gram–driven random forest classifier specifically designed for low-resource language identification. Contribution/Results: Our approach achieves high-accuracy identification (97–100%) for Navajo and eight typologically similar languages—the first such result for this language family—and demonstrates strong cross-Athabaskan generalization and robustness. Leveraging manually verified corpora and rigorous cross-lingual benchmarking, our model significantly outperforms mainstream large language models (e.g., Google’s) despite extreme data scarcity. It maintains stable performance with minimal computational overhead, offering a scalable, reproducible, and resource-efficient solution for automated identification and digital archiving of endangered Indigenous languages.

Technology Category

Application Category

📝 Abstract
Endangered languages, such as Navajo - the most widely spoken Native American language - are significantly underrepresented in contemporary language technologies, exacerbating the challenges of their preservation and revitalization. This study evaluates Google's large language model (LLM)-based language identification system, which consistently misidentifies Navajo, exposing inherent limitations when applied to low-resource Native American languages. To address this, we introduce a random forest classifier trained on Navajo and eight frequently confused languages. Despite its simplicity, the classifier achieves near-perfect accuracy (97-100%), significantly outperforming Google's LLM-based system. Additionally, the model demonstrates robustness across other Athabaskan languages - a family of Native American languages spoken primarily in Alaska, the Pacific Northwest, and parts of the Southwestern United States - suggesting its potential for broader application. Our findings underscore the pressing need for NLP systems that prioritize linguistic diversity and adaptability over centralized, one-size-fits-all solutions, especially in supporting underrepresented languages in a multicultural world. This work directly contributes to ongoing efforts to address cultural biases in language models and advocates for the development of culturally localized NLP tools that serve diverse linguistic communities.
Problem

Research questions and friction points this paper is trying to address.

Endangered Languages
Navajo Language
Recognition Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Random Forest Classifier
Indigenous Language Identification
High Accuracy
🔎 Similar Papers
No similar papers found.