Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

This work addresses the dual challenges of fairness and practicality in NLP for multilingual and low-resource languages—particularly underrepresented ones—amid data scarcity and cultural heterogeneity. To this end, we propose a culturally adaptive, end-to-end NLP development paradigm. Methodologically, it integrates community-engaged data collection, self-supervised parallel sentence mining, few-shot machine translation fine-tuning, zero-shot text classification, and multimodal reasoning interfaces, all within a lightweight modeling framework. We present the first systematic consolidation of over ten cross-linguistic, multi-regional language case studies—including severely under-resourced varieties—packaged as an open-source toolkit and pedagogical resource. Our approach substantially lowers the technical barrier for low-resource language NLP development, enabling reproducible and scalable NLP applications across more than a dozen languages. The work advances equitable, sustainable, and community-driven language technology.

Technology Category

Application Category

📝 Abstract

This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages -- from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.

Problem

Research questions and friction points this paper is trying to address.

Building NLP pipelines for underrepresented languages

Addressing data scarcity and cultural variance challenges

Focusing on fair, reproducible, community-informed development approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Building end-to-end NLP pipelines for underrepresented languages

Tackling data scarcity with hands-on methods and modeling frameworks

Focusing on fair, reproducible, community-informed development approaches

🔎 Similar Papers

No similar papers found.