Textual Entailment and Token Probability as Bias Evaluation Metrics

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the validity of evaluation methods for social bias in language models, systematically comparing two mainstream metrics: natural language inference (NLI)-based and token probability (TP)-based bias scoring. Through multi-dimensional empirical analysis, we find that the two metrics exhibit negligible correlation in bias scores (Spearman’s ρ < 0.1), indicating fundamentally distinct bias-capture mechanisms: NLI is more sensitive to insufficient debiasing but vulnerable to counter-stereotypical phrasing, whereas TP demonstrates greater stability yet lacks semantic plausibility assessment. To reconcile these trade-offs, we propose a three-tier complementary evaluation framework—TP + NLI + downstream task performance—that substantially enhances both robustness and interpretability of bias detection. This study provides the first quantitative evidence of metric non-concordance and establishes a theoretical foundation and practical roadmap for advancing bias evaluation from single-metric paradigms toward multi-dimensional, synergistic assessment.

Technology Category

Application Category

📝 Abstract
Measurement of social bias in language models is typically by token probability (TP) metrics, which are broadly applicable but have been criticized for their distance from real-world langugage model use cases and harms. In this work, we test natural language inference (NLI) as a more realistic alternative bias metric. We show that, curiously, NLI and TP bias evaluation behave substantially differently, with very low correlation among different NLI metrics and between NLI and TP metrics. We find that NLI metrics are more likely to detect "underdebiased" cases. However, NLI metrics seem to be more brittle and sensitive to wording of counterstereotypical sentences than TP approaches. We conclude that neither token probability nor natural language inference is a "better" bias metric in all cases, and we recommend a combination of TP, NLI, and downstream bias evaluations to ensure comprehensive evaluation of language models. Content Warning: This paper contains examples of anti-LGBTQ+ stereotypes.
Problem

Research questions and friction points this paper is trying to address.

Evaluating social bias in language models using token probability metrics
Testing natural language inference as alternative bias evaluation method
Comparing performance between token probability and NLI bias metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using natural language inference for bias evaluation
Comparing token probability with textual entailment metrics
Recommending combined bias evaluation approach for models
🔎 Similar Papers
No similar papers found.