Cross-lingual Text Classification Transfer: The Case of Ukrainian

📅 2024-04-02

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Ukrainian, a low-resource East Slavic language, suffers from severe scarcity of labeled data for text style classification, toxic speech detection, and natural language inference. Method: This work systematically investigates unsupervised cross-lingual knowledge transfer, proposing task-customized migration “recipes” that comprehensively evaluate multilingual encoders (mBERT, XLM-R), LLM zero-shot transfer, neural machine translation back-translation, and lightweight adapter-based fine-tuning. Contribution/Results: We present the first end-to-end efficient cross-lingual transfer for low-resource East Slavic NLP tasks, significantly outperforming baselines. We empirically validate the effectiveness of hybrid strategies—particularly translation followed by fine-tuning combined with adapter-based adaptation—establishing a reusable technical paradigm. Our framework provides both methodological foundations and practical implementation pathways for advancing NLP in under-resourced languages.

Technology Category

Application Category

📝 Abstract

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. To support further fair development of NLP models, exploring the possibilities of effective knowledge transfer to new languages is crucial. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks, i.e., different types of style, or harmful speech, or texts relationships. However, the amount of resources required for such corpora collection from scratch is understandable. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference (NLI) -- providing the ``recipe'' for the optimal setups for each task.

Problem

Research questions and friction points this paper is trying to address.

Addresses data imbalance in Ukrainian text classification.

Explores cross-lingual knowledge transfer without manual curation.

Tests methods on toxicity, formality, and NLI tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual knowledge transfer methods

Multilingual encoders and translation systems

Language adapters for text classification

🔎 Similar Papers

No similar papers found.