Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Current evaluations of bias in large language models predominantly rely on binary frameworks, which fail to capture the nuanced, context-sensitive, and gradational nature of ethical biases in complex social settings. This work proposes a two-stage evaluation framework: first, it introduces a seven-point progressive stress test coupled with a Moral Sensitivity Index (MSI) to systematically quantify model behavior across diverse sociocultural contexts; second, it employs mechanistic interpretability techniques—including logit lens analysis, attention probing, activation patching, and semantic probing—to validate the circuit-level origins of observed biases. The study achieves, for the first time, cross-stage alignment between behavioral outcomes and underlying mechanisms, revealing that inference-time distillation inadvertently reintroduces shallow statistical biases, with bias intensity following a U-shaped curve. Empirically, Gemini 1.5 attains an MSI of 72.7% in socioeconomic contexts, whereas Claude exhibits markedly suppressed bias due to identity-safety training.

📝 Abstract

Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.

Problem

Research questions and friction points this paper is trying to address.

moral sensitivity

contextual bias

large language models

ethical reasoning

bias evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Moral Sensitivity Index

contextual bias

mechanistic interpretability