ElicitationGPT: Text Elicitation Mechanisms via Language Models

📅 2024-06-13
🏛️ arXiv.org
📈 Citations: 8
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work addresses the challenge of automatic evaluation of large language model (LLM)-generated text. Methodologically, it proposes a probability-based textual consistency scoring mechanism that reframes quality assessment as a consistency prediction problem between generated and reference texts. It introduces the first black-box LLM querying framework—applicable to models like ChatGPT—that maps input texts to calibrated predictive probabilities via API queries, and theoretically establishes the propriety of the proposed scoring rule. Experiments on peer-review datasets demonstrate strong agreement with expert human ratings (Spearman’s ρ > 0.85), significantly outperforming conventional baselines. Crucially, this approach achieves, for the first time, domain-agnostic, provably incentive-compatible alignment between AI-generated scores and human preferences—without requiring domain-specific knowledge or fine-tuning. The framework thus provides both theoretical guarantees and a practical tool for trustworthy, human-aligned text evaluation.

Technology Category

Application Category

📝 Abstract
Scoring rules evaluate probabilistic forecasts of an unknown state against the realized state and are a fundamental building block in the incentivized elicitation of information. This paper develops mechanisms for scoring elicited text against ground truth text by reducing the textual information elicitation problem to a forecast elicitation problem, via domain-knowledge-free queries to a large language model (specifically ChatGPT), and empirically evaluates their alignment with human preferences. Our theoretical analysis shows that the reduction achieves provable properness via black-box language models. The empirical evaluation is conducted on peer reviews from a peer-grading dataset, in comparison to manual instructor scores for the peer reviews. Our results suggest a paradigm of algorithmic artificial intelligence that may be useful for developing artificial intelligence technologies with provable guarantees.
Problem

Research questions and friction points this paper is trying to address.

Scoring elicited text against ground truth using language models
Reducing textual information elicitation to forecast elicitation problems
Evaluating mechanism alignment with human preferences empirically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reduces text elicitation to forecast scoring
Uses black-box language models without domain knowledge
Achieves provable properness via theoretical guarantees
🔎 Similar Papers
No similar papers found.