🤖 AI Summary
This work addresses the challenge of assessing the reliability of large language model (LLM) outputs in human-AI collaborative content moderation, where uncertain predictions complicate human escalation decisions. The authors propose a supervised uncertainty quantification framework based on an LLM Performance Predictor (LPP), which leverages log probabilities, information entropy, and a novel uncertainty attribution metric to construct a lightweight meta-model. This meta-model automatically identifies high-risk cases and triggers human review. Notably, this is the first application of LPP to uncertainty estimation in multimodal and multilingual settings. Evaluated across mainstream models—including Gemini, GPT, Llama, and Qwen—the approach significantly outperforms existing methods, achieving a better trade-off between moderation accuracy and human labor cost while providing interpretable attributions for model failures.
📝 Abstract
As LLMs are increasingly integrated into human-in-the-loop content moderation systems, a central challenge is deciding when their outputs can be trusted versus when escalation for human review is preferable. We propose a novel framework for supervised LLM uncertainty quantification, learning a dedicated meta-model based on LLM Performance Predictors (LPPs) derived from LLM outputs: log-probabilities, entropy, and novel uncertainty attribution indicators. We demonstrate that our method enables cost-aware selective classification in real-world human-AI workflows: escalating high-risk cases while automating the rest. Experiments across state-of-the-art LLMs, including both off-the-shelf (Gemini, GPT) and open-source (Llama, Qwen), on multimodal and multilingual moderation tasks, show significant improvements over existing uncertainty estimators in accuracy-cost trade-offs. Beyond uncertainty estimation, the LPPs enhance explainability by providing new insights into failure conditions (e.g., ambiguous content vs. under-specified policy). This work establishes a principled framework for uncertainty-aware, scalable, and responsible human-AI moderation workflows.