🤖 AI Summary
Retrieval-augmented generation (RAG) systems, while enhancing large language model (LLM) performance, may inadvertently amplify societal biases—such as gender or racial stereotypes—when their retrieval modules are poisoned; however, this risk remains uncharacterized. Method: This work establishes, for the first time, a causal link between RAG poisoning and bias amplification, proposing BRRA—the first attack framework targeting bias reinforcement. BRRA employs multi-objective reward-driven adversarial document generation, retrieval embedding subspace projection manipulation, and a closed-loop generate-retrieve-rerank feedback mechanism to sustainably intensify bias. Contribution/Results: Experiments show BRRA increases bias metrics by 42.7% on average across mainstream LLMs. A two-stage defense—comprising retrieval purification and generation calibration—reduces bias amplification below baseline levels, demonstrating a fundamental interplay between RAG security and model fairness.
📝 Abstract
In Large Language Models, Retrieval-Augmented Generation (RAG) systems can significantly enhance the performance of large language models by integrating external knowledge. However, RAG also introduces new security risks. Existing research focuses mainly on how poisoning attacks in RAG systems affect model output quality, overlooking their potential to amplify model biases. For example, when querying about domestic violence victims, a compromised RAG system might preferentially retrieve documents depicting women as victims, causing the model to generate outputs that perpetuate gender stereotypes even when the original query is gender neutral. To show the impact of the bias, this paper proposes a Bias Retrieval and Reward Attack (BRRA) framework, which systematically investigates attack pathways that amplify language model biases through a RAG system manipulation. We design an adversarial document generation method based on multi-objective reward functions, employ subspace projection techniques to manipulate retrieval results, and construct a cyclic feedback mechanism for continuous bias amplification. Experiments on multiple mainstream large language models demonstrate that BRRA attacks can significantly enhance model biases in dimensions. In addition, we explore a dual stage defense mechanism to effectively mitigate the impacts of the attack. This study reveals that poisoning attacks in RAG systems directly amplify model output biases and clarifies the relationship between RAG system security and model fairness. This novel potential attack indicates that we need to keep an eye on the fairness issues of the RAG system.