🤖 AI Summary
This study systematically evaluates the effectiveness and limitations of Reinforcement Learning from Human Feedback (RLHF) in mitigating implicit and explicit biases against African Americans in large language models (LLMs). Method: We conduct experiments on base models including Llama 3-8B using RLHF variants—DPO, ORPO, and RLOO—and employ multi-dimensional evaluation via matched-guise probing (for implicit bias) and explicit bias benchmarks. Contribution/Results: We identify, for the first time, that supervised fine-tuning (SFT) exacerbates rather than alleviates implicit bias. We propose the first transferable multimodal bias evaluation framework and empirically validate its cross-modal applicability. Results indicate that current RLHF methods fail to eliminate implicit bias, underscoring an urgent need for bias-aware datasets, fine-grained data curation techniques, and task-adaptive alignment mechanisms.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, and RLOO) to Llama 3 8B and evaluated the covert and overt biases of the resulting models using matched-guise probing and explicit bias testing. We performed additional tests with DPO on different base models and datasets; among several implications, we found that SFT before RLHF calcifies model biases. Additionally, we extend the tools for measuring biases to multi-modal models. Through our experiments we collect evidence that indicates that current alignment techniques are inadequate for nebulous tasks such as mitigating covert biases, highlighting the need for capable datasets, data curating techniques, or alignment tools.