Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Implicit biases in large language models are difficult to mitigate effectively through conventional safety mechanisms. Inspired by conflict monitoring and response inhibition theories from cognitive neuroscience, this work proposes COCO, the first approach to incorporate a conflict-monitoring mechanism into debiasing research for large models. COCO identifies critical neurons that exhibit both high cohesion and strong contrast during counterfactual generation via contrastive causal analysis. Building on this, the study further introduces LE-COCO and NE-COCO—lightweight, training-free enhancement strategies. Experiments show that ablating COCO-identified neurons causes over 90% of model outputs to revert to biased responses, whereas the proposed methods significantly improve fairness and robustness against jailbreaking attacks on open-domain safety benchmarks, all while preserving the model’s original generative capabilities.

📝 Abstract

In this paper, we study an emergent self-debiasing mechanisms against stereotypical content in Large Language Models (LLMs). Unlike traditional safety mechanisms that are primarily triggered by explicit input-level stimuli, self-debiasing mechanisms can involve generation-time intrinsic correction that are not directly reducible to surface-level prompt. Motivated by conflict-monitoring and response-inhibition accounts in cognitive neuroscience, we propose COCO, a contrastive causal method designed to identify COCO neurons that exhibit high intra-\underline{CO}nsistency yet sharp inter-\underline{CO}ntrast across antithetical generative responses, such as stereotypical versus unbiased outputs. Ablation studies reveal that deactivating COCO neurons leads to a catastrophic collapse of the model's fairness; over 90\% of outputs revert to biased content, far exceeding the bias levels induced by explicit adversarial jailbreak attacks. Observing that simple weight amplification of COCO neurons yields only marginal gains, we propose two training-free, lightweight editing strategies: Local Enhancement (LE-COCO) and Networked Enhancement (NE-COCO). Comprehensive evaluations show that our methods bolster robustness against adversarial jailbreaks and achieve strong performance on open-ended safety benchmarks, while preserving foundational generative proficiency. While this study primarily addresses social stereotypes, the COCO mechanism holds significant potential for diverse domains like hallucination detection, offering valuable insights toward the development of self-evolving AI agents.

Problem

Research questions and friction points this paper is trying to address.

implicit conflict monitoring

self-debiasing

stereotypes

large language models

fairness

Innovation

Methods, ideas, or system contributions that make the work stand out.

conflict monitoring

self-debiasing

COCO neurons