LLM Performance Predictors: Learning When to Escalate in Hybrid Human-AI Moderation Systems

📅 2026-01-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of assessing the reliability of large language model (LLM) outputs in human-AI collaborative content moderation, where uncertain predictions complicate human escalation decisions. The authors propose a supervised uncertainty quantification framework based on an LLM Performance Predictor (LPP), which leverages log probabilities, information entropy, and a novel uncertainty attribution metric to construct a lightweight meta-model. This meta-model automatically identifies high-risk cases and triggers human review. Notably, this is the first application of LPP to uncertainty estimation in multimodal and multilingual settings. Evaluated across mainstream models—including Gemini, GPT, Llama, and Qwen—the approach significantly outperforms existing methods, achieving a better trade-off between moderation accuracy and human labor cost while providing interpretable attributions for model failures.

Technology Category

Application Category

📝 Abstract
As LLMs are increasingly integrated into human-in-the-loop content moderation systems, a central challenge is deciding when their outputs can be trusted versus when escalation for human review is preferable. We propose a novel framework for supervised LLM uncertainty quantification, learning a dedicated meta-model based on LLM Performance Predictors (LPPs) derived from LLM outputs: log-probabilities, entropy, and novel uncertainty attribution indicators. We demonstrate that our method enables cost-aware selective classification in real-world human-AI workflows: escalating high-risk cases while automating the rest. Experiments across state-of-the-art LLMs, including both off-the-shelf (Gemini, GPT) and open-source (Llama, Qwen), on multimodal and multilingual moderation tasks, show significant improvements over existing uncertainty estimators in accuracy-cost trade-offs. Beyond uncertainty estimation, the LPPs enhance explainability by providing new insights into failure conditions (e.g., ambiguous content vs. under-specified policy). This work establishes a principled framework for uncertainty-aware, scalable, and responsible human-AI moderation workflows.
Problem

Research questions and friction points this paper is trying to address.

LLM uncertainty
human-AI moderation
trustworthiness
escalation decision
content moderation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM Performance Predictors
uncertainty quantification
human-AI moderation
selective classification
explainability
🔎 Similar Papers
No similar papers found.
O
Or Bachar
Zefr, Los Angeles, United States
O
Or Levi
Zefr, Los Angeles, United States
S
Sardhendu Mishra
Zefr, Los Angeles, United States
A
Adi Levi
Zefr, Los Angeles, United States
M
Manpreet Singh Minhas
Zefr, Los Angeles, United States
Justin Miller
Justin Miller
MIT, Ford Motor Company
RoboticsArtificial IntelligenceMachine Learning
Omer Ben-Porat
Omer Ben-Porat
Assistant Professor, Technion—Israel Institute of Technology
Economics and ComputationMulti-Agent SystemsMachine LearningData Science
E
Eilon Sheetrit
Reichman University, Herzliya, Israel
J
Jonathan Morra
Zefr, Los Angeles, United States