🤖 AI Summary
Current evaluations of bias in large language models often reduce bias to a single scalar metric, overlooking how variations in prompt formatting influence bias manifestations and failing to account for inconsistencies between stance in response selection versus elaboration. This work proposes BiAxisAudit, a framework that systematically assesses bias along two orthogonal dimensions: across prompts—via factorized designs manipulating task format, perspective, role, and sentiment—and within responses—by applying Split Coding to disentangle selection and elaboration signals. Experiments on 641,600 human-annotated responses across eight models reveal that task format contributes to bias variance on par with model differences; 63.6% of biased instances manifest in only one response layer; and certain prompt configurations simultaneously reduce both Bias Exposure Rate (BER) and Inconsistency Rate (IR), effectively uncovering “cancellation traps” and spurious bias mitigation effects.
📝 Abstract
Bias audits of large language models now operate within governance frameworks such as the EU AI Act, making benchmark reliability a security concern in its own right. Many current benchmarks, however, collapse bias into a single scalar from one prompt format and one surface label. This design misses two failure modes that can be exploited without changing model weights. Across prompts, meaning-preserving format changes shift bias endorsement by more than $0.7$ on a fixed statement pool. Within a response, the discrete Selection and free-text Elaboration can take opposing stances, so an apparently clean aggregate may hide substantial internal inconsistency (a ``cancellation trap''). Selection-only and elaboration-only rankings are therefore nearly uncorrelated across eight LLMs (Spearman $ρ= 0.238$, $p = 0.570$): LLaMA3-70B ranks in the middle under selection-only scoring but highest under elaboration-only scoring on the same responses. We introduce \textsc{BiAxisAudit}, a protocol that reports each bias score together with a reliability estimate on two orthogonal axes. The across-prompt axis evaluates each statement under a factorial grid of task format, perspective, role, and sentiment, treating bias as a distribution rather than a point estimate. The within-response axis uses Split Coding to recover Selection and Elaboration as separate signals, measured by the Inconsistency Rate and Divergence Net Imbalance. Across eight LLMs with $80{,}200$ coded responses each, task format alone explains as much variance as model choice; $63.6\%$ of pooled bias signals (up to $85.2\%$ per model) appear in only one coding layer, and prompt-dimension interactions exceed main effects. The instrument also separates real bias reductions from apparent reductions caused by cross-layer redistribution: some prompt configurations reduce both BER and IR, whereas others suppress only selection-layer bias.