Aligning to What? Limits to RLHF Based Alignment

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study systematically evaluates the effectiveness and limitations of Reinforcement Learning from Human Feedback (RLHF) in mitigating implicit and explicit biases against African Americans in large language models (LLMs). Method: We conduct experiments on base models including Llama 3-8B using RLHF variants—DPO, ORPO, and RLOO—and employ multi-dimensional evaluation via matched-guise probing (for implicit bias) and explicit bias benchmarks. Contribution/Results: We identify, for the first time, that supervised fine-tuning (SFT) exacerbates rather than alleviates implicit bias. We propose the first transferable multimodal bias evaluation framework and empirically validate its cross-modal applicability. Results indicate that current RLHF methods fail to eliminate implicit bias, underscoring an urgent need for bias-aware datasets, fine-grained data curation techniques, and task-adaptive alignment mechanisms.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, and RLOO) to Llama 3 8B and evaluated the covert and overt biases of the resulting models using matched-guise probing and explicit bias testing. We performed additional tests with DPO on different base models and datasets; among several implications, we found that SFT before RLHF calcifies model biases. Additionally, we extend the tools for measuring biases to multi-modal models. Through our experiments we collect evidence that indicates that current alignment techniques are inadequate for nebulous tasks such as mitigating covert biases, highlighting the need for capable datasets, data curating techniques, or alignment tools.

Problem

Research questions and friction points this paper is trying to address.

Investigates RLHF's effectiveness in reducing biases in LLMs.

Focuses on biases against African Americans in language models.

Highlights inadequacy of current alignment techniques for covert biases.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Applied RLHF techniques to Llama 3 8B

Extended bias measurement to multi-modal models

Highlighted need for better alignment tools

🔎 Similar Papers

No similar papers found.