BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current evaluations of bias in large language models often reduce bias to a single scalar metric, overlooking how variations in prompt formatting influence bias manifestations and failing to account for inconsistencies between stance in response selection versus elaboration. This work proposes BiAxisAudit, a framework that systematically assesses bias along two orthogonal dimensions: across prompts—via factorized designs manipulating task format, perspective, role, and sentiment—and within responses—by applying Split Coding to disentangle selection and elaboration signals. Experiments on 641,600 human-annotated responses across eight models reveal that task format contributes to bias variance on par with model differences; 63.6% of biased instances manifest in only one response layer; and certain prompt configurations simultaneously reduce both Bias Exposure Rate (BER) and Inconsistency Rate (IR), effectively uncovering “cancellation traps” and spurious bias mitigation effects.

📝 Abstract

Bias audits of large language models now operate within governance frameworks such as the EU AI Act, making benchmark reliability a security concern in its own right. Many current benchmarks, however, collapse bias into a single scalar from one prompt format and one surface label. This design misses two failure modes that can be exploited without changing model weights. Across prompts, meaning-preserving format changes shift bias endorsement by more than $0.7$ on a fixed statement pool. Within a response, the discrete Selection and free-text Elaboration can take opposing stances, so an apparently clean aggregate may hide substantial internal inconsistency (a ``cancellation trap''). Selection-only and elaboration-only rankings are therefore nearly uncorrelated across eight LLMs (Spearman $ρ= 0.238$, $p = 0.570$): LLaMA3-70B ranks in the middle under selection-only scoring but highest under elaboration-only scoring on the same responses. We introduce \textsc{BiAxisAudit}, a protocol that reports each bias score together with a reliability estimate on two orthogonal axes. The across-prompt axis evaluates each statement under a factorial grid of task format, perspective, role, and sentiment, treating bias as a distribution rather than a point estimate. The within-response axis uses Split Coding to recover Selection and Elaboration as separate signals, measured by the Inconsistency Rate and Divergence Net Imbalance. Across eight LLMs with $80{,}200$ coded responses each, task format alone explains as much variance as model choice; $63.6\%$ of pooled bias signals (up to $85.2\%$ per model) appear in only one coding layer, and prompt-dimension interactions exceed main effects. The instrument also separates real bias reductions from apparent reductions caused by cross-layer redistribution: some prompt configurations reduce both BER and IR, whereas others suppress only selection-layer bias.

Problem

Research questions and friction points this paper is trying to address.

LLM bias

prompt sensitivity

response-layer divergence

bias audit

inconsistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

BiAxisAudit

prompt sensitivity

response-layer divergence