🤖 AI Summary
This work addresses systemic risks in large language models (LLMs) concerning safety, bias, and ideological alignment. We propose a two-stage prompt-engineering-based self-reflection framework designed to mitigate these risks. Through systematic experiments across multiple models, tasks, and prompts, we quantitatively evaluate performance along three dimensions: toxicity, gender bias, and partisan leaning. Our results demonstrate, for the first time, that self-reflection significantly enhances both safety and fairness: toxic responses decrease by 75.8% (with 97.8% retention of non-toxic outputs), gender bias declines by 77% (with 94.3% retention of unbiased outputs), and partisan leaning is fully eliminated (with 87.7% retention of neutral responses). Crucially, we identify strong dependence on prompt design and delineate the precise efficacy boundaries of self-reflection for safety and fairness—challenging prior assumptions that focused exclusively on its benefits for reasoning improvement.
📝 Abstract
Previous studies proposed that the reasoning capabilities of large language models (LLMs) can be improved through self-reflection, i.e., letting LLMs reflect on their own output to identify and correct mistakes in the initial responses. However, earlier experiments offer mixed results when it comes to the benefits of self-reflection. Furthermore, prior studies on self-reflection are predominantly concerned with the reasoning capabilities of models, ignoring the potential for self-reflection in safety, bias, and ideological leaning. Here, by conducting a series of experiments testing LLM's self-reflection capability in various tasks using a variety of prompts and different LLMs, we make several contributions to the literature. First, we reconcile conflicting findings regarding the benefit of self-reflection, by demonstrating that the outcome of self-reflection is sensitive to prompt wording -- both the original prompt that are used to elicit an initial answer and the subsequent prompt used to self-reflect. Specifically, although self-reflection may improve the reasoning capability of LLMs when the initial response is simple, the technique cannot improve upon the state-of-the-art chain-of-thought (CoT) prompting. Second, we show that self-reflection can lead to safer (75.8% reduction in toxic responses while preserving 97.8% non-toxic ones), less biased (77% reduction in gender biased responses, while preserving 94.3% unbiased ones), and more ideologically neutral responses (100% reduction in partisan leaning response, while preserving 87.7% non-partisan ones). The paper concludes by discussing the implications of our findings on the deployment of large language models. We release our experiments at https://github.com/Michael98Liu/self-reflection.