When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the susceptibility of large language models (LLMs) to systematic biases in high-stakes decision-making, where irrelevant features such as names, authority titles, and framing effects unduly influence outcomes. The authors propose ICE-Guard, a novel framework that systematically quantifies three bias categories—demographic, authority-based, and framing-induced—and reveals that the latter two significantly outweigh demographic bias. By integrating intervention consistency testing, a structured disentangled decision mechanism, and an iterative prompting repair loop, ICE-Guard substantially reduces model reliance on spurious cues. Evaluated across 11 mainstream LLMs, the approach achieves a median bias reduction of 49% (up to 100%), with cumulative reductions reaching 78% through ICE-guided refinement. Notably, authority bias in financial contexts reaches 22.6%, highlighting domain-specific disparities. Experiments on both synthetic and real-world data—including the COMPAS recidivism dataset—demonstrate the method’s robust effectiveness.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.

Problem

Research questions and friction points this paper is trying to address.

spurious features

systematic bias

large language models

high-stakes decisions

intervention consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

intervention consistency

spurious feature reliance

structured decomposition