Bilingual Word Level Language Identification for Omotic Languages

📅 2025-09-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the word-level bilingual identification (BLID) challenge between Wolaita and Gofa—two low-resource, highly lexically similar Omotic languages spoken in southern Ethiopia. We propose a BERT-LSTM hybrid model: multilingual BERT serves as the contextualized word embedding backbone, while a bidirectional LSTM layer captures morphological and sequential contextual features; training is further enhanced via lexical data augmentation and class-balancing strategies. Evaluated on the first manually curated Wolaita–Gofa word-level bilingual test set, our model achieves an F1 score of 72.0%, substantially outperforming baseline approaches. This is the first systematic effort to tackle fine-grained language identification for these endangered languages. The proposed methodology establishes a reusable technical framework for low-resource Omotic NLP tasks, with direct implications for social media content moderation and digital language resource development.

Technology Category

Application Category

📝 Abstract
Language identification is the task of determining the languages for a given text. In many real world scenarios, text may contain more than one language, particularly in multilingual communities. Bilingual Language Identification (BLID) is the task of identifying and distinguishing between two languages in a given text. This paper presents BLID for languages spoken in the southern part of Ethiopia, namely Wolaita and Gofa. The presence of words similarities and differences between the two languages makes the language identification task challenging. To overcome this challenge, we employed various experiments on various approaches. Then, the combination of the BERT based pretrained language model and LSTM approach performed better, with an F1 score of 0.72 on the test set. As a result, the work will be effective in tackling unwanted social media issues and providing a foundation for further research in this area.
Problem

Research questions and friction points this paper is trying to address.

Identifies bilingual text in Wolaita and Gofa languages
Addresses word similarity challenges in Omotic language identification
Develops model for social media content classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

BERT and LSTM hybrid model
Bilingual identification for Wolaita-Gofa
Addresses similar word challenges
🔎 Similar Papers
No similar papers found.