Stronger Than You Think: Benchmarking Weak Supervision on Realistic Tasks

📅 2025-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing weak supervision evaluation methods typically rely on simplified tasks and balanced label distributions, failing to reflect real-world challenges such as severe class imbalance, domain-specific expertise, and multilingual parallelism. To address this gap, we propose BOXWRENCH—a novel benchmark and evaluation framework specifically designed for realistic constraints: expert-crafted labeling functions (LFs), multilingual support, high label cardinality, and long-tailed class distributions. BOXWRENCH integrates LF modeling, probabilistic label generation (e.g., Snorkel, FlyingSquid), and end-to-end pipelines, validated across multiple realistic NLP tasks. Empirically, weak supervision using only noisy, unlabeled data achieves performance comparable to—or even surpassing—that of fully supervised models trained on 1,000+ high-quality annotations. Our key contributions are (1) the first deployment-oriented weak supervision benchmark reflecting production constraints, and (2) empirical evidence demonstrating the practical viability of weak supervision in low-resource, high-complexity settings.

Technology Category

Application Category

📝 Abstract
Weak supervision (WS) is a popular approach for label-efficient learning, leveraging diverse sources of noisy but inexpensive weak labels to automatically annotate training data. Despite its wide usage, WS and its practical value are challenging to benchmark due to the many knobs in its setup, including: data sources, labeling functions (LFs), aggregation techniques (called label models), and end model pipelines. Existing evaluation suites tend to be limited, focusing on particular components or specialized use cases. Moreover, they often involve simplistic benchmark tasks or de-facto LF sets that are suboptimally written, producing insights that may not generalize to real-world settings. We address these limitations by introducing a new benchmark, BOXWRENCH, designed to more accurately reflect real-world usages of WS. This benchmark features tasks with (1) higher class cardinality and imbalance, (2) notable domain expertise requirements, and (3) multilingual variations across parallel corpora. For all tasks, LFs are written using a careful procedure aimed at mimicking real-world settings. In contrast to existing WS benchmarks, we show that supervised learning requires substantial amounts (1000+) of labeled examples to match WS in many settings.
Problem

Research questions and friction points this paper is trying to address.

Weakly Supervised Learning
Evaluation Limitations
Real-world Application
Innovation

Methods, ideas, or system contributions that make the work stand out.

BOXWRENCH
Weak Supervised Learning
Real-world Evaluation
🔎 Similar Papers
No similar papers found.