BhashaSetu: Cross-Lingual Knowledge Transfer from High-Resource to Extreme Low-Resource Languages

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

This work addresses the significant performance gap in natural language processing for low-resource languages due to the scarcity of annotated data. The authors propose Graph-Enhanced Token Representation (GETR), a method that leverages graph neural networks to model cross-lingual semantic structures, enabling effective knowledge transfer from high-resource languages to low-resource ones with only a few hundred labeled examples. Evaluated on real low-resource languages such as Mizo and Khasi, GETR improves POS tagging performance by 13 macro-F1 points over baseline approaches based on hidden-layer augmentation and word-translation embeddings. In simulated low-resource settings, it further achieves gains of 20 and 27 macro-F1 points on Marathi sentiment classification and named entity recognition, respectively, thereby systematically uncovering key factors underlying successful cross-lingual transfer.

Technology Category

Application Category

📝 Abstract

Despite remarkable advances in natural language processing, developing effective systems for low-resource languages remains a formidable challenge, with performances typically lagging far behind high-resource counterparts due to data scarcity and insufficient linguistic resources. Cross-lingual knowledge transfer has emerged as a promising approach to address this challenge by leveraging resources from high-resource languages. In this paper, we investigate methods for transferring linguistic knowledge from high-resource languages to low-resource languages, where the number of labeled training instances is in hundreds. We focus on sentence-level and word-level tasks. We introduce a novel method, GETR (Graph-Enhanced Token Representation) for cross-lingual knowledge transfer along with two adopted baselines (a) augmentation in hidden layers and (b) token embedding transfer through token translation. Experimental results demonstrate that our GNN-based approach significantly outperforms existing multilingual and cross-lingual baseline methods, achieving 13 percentage point improvements on truly low-resource languages (Mizo, Khasi) for POS tagging, and 20 and 27 percentage point improvements in macro-F1 on simulated low-resource languages (Marathi, Bangla, Malayalam) across sentiment classification and NER tasks respectively. We also present a detailed analysis of the transfer mechanisms and identify key factors that contribute to successful knowledge transfer in this linguistic context.

Problem

Research questions and friction points this paper is trying to address.

low-resource languages

cross-lingual knowledge transfer

data scarcity

natural language processing

extreme low-resource

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-lingual knowledge transfer

graph-enhanced token representation

low-resource languages