Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

📅 2026-03-21

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the susceptibility of large language models to non-semantic factors—such as option ordering and label symbols—in multiple-choice and pairwise evaluation tasks, which often leads to selection bias. To mitigate this issue, the authors propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), a novel reinforcement learning approach that, for the first time, incorporates permutation invariance into policy training. By constructing input permutation groups and introducing cross-permutation advantage estimation alongside a consistency-aware reward mechanism, PA-GRPO encourages the model to produce semantically consistent outputs under permutation perturbations—without requiring any intervention during inference. Experimental results across seven benchmarks demonstrate that PA-GRPO significantly outperforms strong baselines, substantially reducing selection bias while maintaining or even improving overall performance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).

Problem

Research questions and friction points this paper is trying to address.

selection bias

large language models

multiple-choice tasks

pairwise evaluation

permutation consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

selection bias

permutation consistency

relative policy optimization