Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of sporadic errors in structured outputs generated by large language models (LLMs), which hinder their reliable deployment in enterprise settings. The authors propose CONSTRUCT, a method that estimates real-time confidence scores based on output uncertainty, enabling assessment of both overall and field-level reliability for any LLM—including black-box APIs without access to log probabilities—without requiring labeled data or model customization. CONSTRUCT supports heterogeneous fields and nested JSON structures. The study introduces the first public benchmark for structured generation with reliable ground-truth annotations. Evaluated across four datasets involving models such as Gemini 3 and GPT-5, CONSTRUCT significantly outperforms existing approaches in precision and recall for error detection.
📝 Abstract
Structured Outputs from current LLMs exhibit sporadic errors, hindering enterprise AI deployment. We present CONSTRUCT, a real-time uncertainty estimator that scores the trustworthiness of LLM Structured Outputs. Lower-scoring outputs are more likely to contain errors, enabling automatic prioritization of limited human review bandwidth. CONSTRUCT additionally scores the trustworthiness of each field within a Structured Output, helping reviewers quickly identify which parts of the output are incorrect. Our method is suitable for any LLM (including black-box LLM APIs without logprobs), does not require labeled training data or custom model deployment, and supports complex Structured Outputs with heterogeneous fields and nested JSON schemas. We also introduce one of the first public LLM Structured Output benchmarks with reliable ground-truth values. Over this four-dataset benchmark, CONSTRUCT detects errors in outputs from various LLMs (including Gemini 3 and GPT-5) with significantly higher precision/recall than existing techniques.
Problem

Research questions and friction points this paper is trying to address.

LLM Structured Outputs
trustworthiness scoring
real-time uncertainty estimation
error detection
enterprise AI deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

trustworthiness scoring
structured output
uncertainty estimation
LLM error detection
real-time evaluation
🔎 Similar Papers
No similar papers found.