🤖 AI Summary
This study addresses a structural tension in low-resource natural language processing (NLP): the rapid expansion of technical capabilities juxtaposed with a scarcity of linguistic expertise needed for robust evaluation, giving rise to the “annotation scarcity paradox”—where advances in model performance exacerbate evaluative inequities. Through a critical narrative review spanning 2014 to 2026, the paper systematically examines validity threats and power imbalances inherent in current evaluation practices, positioning evaluation sovereignty and knowledge governance at the core of methodological reform. It advocates shifting from extractive, transactional data paradigms toward community-embedded, relational evaluation frameworks. To this end, the work integrates techniques such as data augmentation, model-based evaluation, participatory curation, item response theory, and active learning to construct a next-generation evaluation framework for low-resource NLP that jointly upholds fairness, validity, and data sovereignty.
📝 Abstract
Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014--present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examining extractive data pipelines, undercompensated ``ghost work'', and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress. We survey emerging responses -- including data augmentation, model-based evaluation, participatory curation, and annotation-efficient approaches via item response theory and active learning -- and assess their equity and validity trade-offs. We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership.