🤖 AI Summary
This work addresses the significant performance gap in natural language processing for low-resource languages due to the scarcity of annotated data. The authors propose Graph-Enhanced Token Representation (GETR), a method that leverages graph neural networks to model cross-lingual semantic structures, enabling effective knowledge transfer from high-resource languages to low-resource ones with only a few hundred labeled examples. Evaluated on real low-resource languages such as Mizo and Khasi, GETR improves POS tagging performance by 13 macro-F1 points over baseline approaches based on hidden-layer augmentation and word-translation embeddings. In simulated low-resource settings, it further achieves gains of 20 and 27 macro-F1 points on Marathi sentiment classification and named entity recognition, respectively, thereby systematically uncovering key factors underlying successful cross-lingual transfer.
📝 Abstract
Despite remarkable advances in natural language processing, developing effective systems for low-resource languages remains a formidable challenge, with performances typically lagging far behind high-resource counterparts due to data scarcity and insufficient linguistic resources. Cross-lingual knowledge transfer has emerged as a promising approach to address this challenge by leveraging resources from high-resource languages. In this paper, we investigate methods for transferring linguistic knowledge from high-resource languages to low-resource languages, where the number of labeled training instances is in hundreds. We focus on sentence-level and word-level tasks. We introduce a novel method, GETR (Graph-Enhanced Token Representation) for cross-lingual knowledge transfer along with two adopted baselines (a) augmentation in hidden layers and (b) token embedding transfer through token translation. Experimental results demonstrate that our GNN-based approach significantly outperforms existing multilingual and cross-lingual baseline methods, achieving 13 percentage point improvements on truly low-resource languages (Mizo, Khasi) for POS tagging, and 20 and 27 percentage point improvements in macro-F1 on simulated low-resource languages (Marathi, Bangla, Malayalam) across sentiment classification and NER tasks respectively. We also present a detailed analysis of the transfer mechanisms and identify key factors that contribute to successful knowledge transfer in this linguistic context.