Prompting in the Dark: Assessing Human Performance in Prompt Engineering for Data Labeling When Gold Labels Are Absent

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Evaluating human prompt engineering efficacy in the absence of gold-standard labels remains an open challenge. Method: We propose “Prompting in the Dark,” a novel paradigm for unsupervised prompt engineering evaluation, implemented via PromptingSheet—a Google Sheets plugin enabling structured, reproducible, iterative LLM data annotation experiments. We conduct a controlled human-subject study (N=20) and benchmark against automated tools (e.g., DSPy). Contribution/Results: Only 45% of participants improved annotation accuracy after four or more prompt iterations; mainstream automated prompt optimization methods fail markedly under zero- or few-shot gold-label conditions. This work provides the first systematic empirical evidence of human unreliability in unsupervised prompt engineering and establishes the irreplaceable role of gold-standard labels in validating prompt effectiveness. It contributes both a theoretical framework and practical tooling—PromptingSheet—for trustworthy, reproducible prompt evaluation.

Technology Category

Application Category

📝 Abstract

Millions of users prompt large language models (LLMs) for various tasks, but how good are people at prompt engineering? Do users actually get closer to their desired outcome over multiple iterations of their prompts? These questions are crucial when no gold-standard labels are available to measure progress. This paper investigates a scenario in LLM-powered data labeling,"prompting in the dark,"where users iteratively prompt LLMs to label data without using manually-labeled benchmarks. We developed PromptingSheet, a Google Sheets add-on that enables users to compose, revise, and iteratively label data through spreadsheets. Through a study with 20 participants, we found that prompting in the dark was highly unreliable-only 9 participants improved labeling accuracy after four or more iterations. Automated prompt optimization tools like DSPy also struggled when few gold labels were available. Our findings highlight the importance of gold labels and the needs, as well as the risks, of automated support in human prompt engineering, providing insights for future tool design.

Problem

Research questions and friction points this paper is trying to address.

Assessing human performance in prompt engineering

Evaluating iterative prompt improvements without gold labels

Investigating reliability of LLM-powered data labeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative prompt engineering without benchmarks

Google Sheets add-on for data labeling

Automated tools struggle without gold labels

🔎 Similar Papers

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models