Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the dual challenges of fairness and practicality in NLP for multilingual and low-resource languages—particularly underrepresented ones—amid data scarcity and cultural heterogeneity. To this end, we propose a culturally adaptive, end-to-end NLP development paradigm. Methodologically, it integrates community-engaged data collection, self-supervised parallel sentence mining, few-shot machine translation fine-tuning, zero-shot text classification, and multimodal reasoning interfaces, all within a lightweight modeling framework. We present the first systematic consolidation of over ten cross-linguistic, multi-regional language case studies—including severely under-resourced varieties—packaged as an open-source toolkit and pedagogical resource. Our approach substantially lowers the technical barrier for low-resource language NLP development, enabling reproducible and scalable NLP applications across more than a dozen languages. The work advances equitable, sustainable, and community-driven language technology.

Technology Category

Application Category

📝 Abstract
This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages -- from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.
Problem

Research questions and friction points this paper is trying to address.

Building NLP pipelines for underrepresented languages
Addressing data scarcity and cultural variance challenges
Focusing on fair, reproducible, community-informed development approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Building end-to-end NLP pipelines for underrepresented languages
Tackling data scarcity with hands-on methods and modeling frameworks
Focusing on fair, reproducible, community-informed development approaches
🔎 Similar Papers
No similar papers found.