Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study investigates whether reasoning language models (RLMs)—designed to enhance multi-step reasoning—exhibit improved robustness against social biases. We systematically evaluate the impact of chain-of-thought prompting and reasoning-path fine-tuning on fairness using the CLEAR-Bias benchmark, LLM-as-a-judge automated evaluation, jailbreaking attacks, and multidimensional cultural sensitivity analysis. Contrary to the “reasoning implies safety” hypothesis, our results demonstrate that explicit reasoning mechanisms exacerbate bias exposure, particularly under narrative-style adversarial attacks: RLMs are more susceptible than base models to generating biased outputs when prompted with socially sensitive scenarios. This work provides the first empirical evidence of a positive correlation between reasoning capability and bias vulnerability. Its core contribution is the identification of this counterintuitive relationship and the proposal of a novel design paradigm—bias-aware reasoning architectures—that explicitly integrate bias perception into the reasoning process to mitigate fairness risks.

Technology Category

Application Category

📝 Abstract

Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.

Problem

Research questions and friction points this paper is trying to address.

Investigates how reasoning capabilities affect model fairness and robustness

Examines if reasoning fine-tuned models are safer than CoT-prompted ones

Assesses jailbreak attack success rates across different reasoning mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using CLEAR-Bias benchmark for bias evaluation

Employing LLM-as-a-judge for automated safety scoring

Leveraging jailbreak techniques to test safety mechanisms

🔎 Similar Papers

No similar papers found.