🤖 AI Summary
This study investigates whether reasoning language models (RLMs)—designed to enhance multi-step reasoning—exhibit improved robustness against social biases. We systematically evaluate the impact of chain-of-thought prompting and reasoning-path fine-tuning on fairness using the CLEAR-Bias benchmark, LLM-as-a-judge automated evaluation, jailbreaking attacks, and multidimensional cultural sensitivity analysis. Contrary to the “reasoning implies safety” hypothesis, our results demonstrate that explicit reasoning mechanisms exacerbate bias exposure, particularly under narrative-style adversarial attacks: RLMs are more susceptible than base models to generating biased outputs when prompted with socially sensitive scenarios. This work provides the first empirical evidence of a positive correlation between reasoning capability and bias vulnerability. Its core contribution is the identification of this counterintuitive relationship and the proposal of a novel design paradigm—bias-aware reasoning architectures—that explicitly integrate bias perception into the reasoning process to mitigate fairness risks.
📝 Abstract
Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.