ElicitationGPT: Text Elicitation Mechanisms via Language Models

📅 2024-06-13

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenge of automatic evaluation of large language model (LLM)-generated text. Methodologically, it proposes a probability-based textual consistency scoring mechanism that reframes quality assessment as a consistency prediction problem between generated and reference texts. It introduces the first black-box LLM querying framework—applicable to models like ChatGPT—that maps input texts to calibrated predictive probabilities via API queries, and theoretically establishes the propriety of the proposed scoring rule. Experiments on peer-review datasets demonstrate strong agreement with expert human ratings (Spearman’s ρ > 0.85), significantly outperforming conventional baselines. Crucially, this approach achieves, for the first time, domain-agnostic, provably incentive-compatible alignment between AI-generated scores and human preferences—without requiring domain-specific knowledge or fine-tuning. The framework thus provides both theoretical guarantees and a practical tool for trustworthy, human-aligned text evaluation.

Technology Category

Application Category

📝 Abstract

Scoring rules evaluate probabilistic forecasts of an unknown state against the realized state and are a fundamental building block in the incentivized elicitation of information. This paper develops mechanisms for scoring elicited text against ground truth text by reducing the textual information elicitation problem to a forecast elicitation problem, via domain-knowledge-free queries to a large language model (specifically ChatGPT), and empirically evaluates their alignment with human preferences. Our theoretical analysis shows that the reduction achieves provable properness via black-box language models. The empirical evaluation is conducted on peer reviews from a peer-grading dataset, in comparison to manual instructor scores for the peer reviews. Our results suggest a paradigm of algorithmic artificial intelligence that may be useful for developing artificial intelligence technologies with provable guarantees.

Problem

Research questions and friction points this paper is trying to address.

Scoring elicited text against ground truth using language models

Reducing textual information elicitation to forecast elicitation problems

Evaluating mechanism alignment with human preferences empirically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reduces text elicitation to forecast scoring

Uses black-box language models without domain knowledge

Achieves provable properness via theoretical guarantees

🔎 Similar Papers

No similar papers found.