ScriptoriumWS: A Code Generation Assistant for Weak Supervision

📅 2025-02-17

📈 Citations: 3

✨ Influential: 1

career value

154K/year

🤖 AI Summary

To address the high cost and low coverage of manually crafting high-quality labeling functions (LFs) in weak supervision, this paper pioneers the use of large language models (LLMs) as coding assistants, proposing a multi-level prompting strategy to automatically generate domain-adapted, noisy LFs. Our method jointly integrates handcrafted and LLM-generated LFs, leveraging weak supervision label aggregation (e.g., Snorkel) and a rule quality assessment mechanism to achieve optimal fusion. Experiments across multiple real-world tasks show that our approach matches human-written LFs in pseudo-label accuracy while substantially improving LF coverage (+38.2% on average) and reducing expert effort (62% fewer manual rules). The core contribution is the establishment of a novel LLM-augmented paradigm for weak supervision rule engineering—featuring an interpretable, integrable generation-and-enhancement framework that bridges human expertise and generative AI.

Technology Category

Application Category

📝 Abstract

Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.

Problem

Research questions and friction points this paper is trying to address.

Overcomes data labeling bottleneck using weak supervision.

Employs code-generation models for creating supervision sources.

Combines hand-written and generated sources effectively.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Code-generation models assist weak supervision

Multi-tier prompting enhances source quality

Combines hand-written and generated sources effectively

🔎 Similar Papers

Is The Watermarking Of LLM-Generated Code Robust?