🤖 AI Summary
This study addresses the lack of systematic analysis regarding the reliability and bias of large language models (LLMs) when deployed as payment risk assessors. We propose a loop-free multi-evaluator framework that integrates a five-dimensional scoring rubric, Monte Carlo scoring, and consensus–bias metrics, validated against expert judgments and real-world payment network data. For the first time, we quantify self-assessment bias in LLMs—e.g., GPT-5.1 exhibits a negative bias of −0.33, while Gemini-2.5 Pro shows a positive bias of +0.77—and demonstrate that anonymization reduces bias by 25.8%. Risk scores from four leading LLMs correlate significantly with ground-truth data (Spearman’s ρ = 0.56–0.77), with negatively biased models aligning more closely with human expert assessments. Our work establishes a reproducible evaluation paradigm for LLM-based risk assessors.
📝 Abstract
Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood. We introduce a structured multi-evaluator framework for assessing LLM reasoning in Merchant Category Code (MCC)-based merchant risk assessment, combining a five-criterion rubric with Monte-Carlo scoring to evaluate rationale quality and evaluator stability. Five frontier LLMs generate and cross-evaluate MCC risk rationales under attributed and anonymized conditions. To establish a judge-independent reference, we introduce a consensus-deviation metric that eliminates circularity by comparing each judge's score to the mean of all other judges, yielding a theoretically grounded measure of self-evaluation and cross-model deviation. Results reveal substantial heterogeneity: GPT-5.1 and Claude 4.5 Sonnet show negative self-evaluation bias (-0.33, -0.31), while Gemini-2.5 Pro and Grok 4 display positive bias (+0.77, +0.71), with bias attenuating by 25.8 percent under anonymization. Evaluation by 26 payment-industry experts shows LLM judges assign scores averaging +0.46 points above human consensus, and that the negative bias of GPT-5.1 and Claude 4.5 Sonnet reflects closer alignment with human judgment. Ground-truth validation using payment-network data shows four models exhibit statistically significant alignment (Spearman rho = 0.56 to 0.77), confirming that the framework captures genuine quality. Overall, the framework provides a replicable basis for evaluating LLM-as-a-judge systems in payment-risk workflows and highlights the need for bias-aware protocols in operational financial settings.