Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the susceptibility of large language models to non-semantic factors—such as option ordering and label symbols—in multiple-choice and pairwise evaluation tasks, which often leads to selection bias. To mitigate this issue, the authors propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), a novel reinforcement learning approach that, for the first time, incorporates permutation invariance into policy training. By constructing input permutation groups and introducing cross-permutation advantage estimation alongside a consistency-aware reward mechanism, PA-GRPO encourages the model to produce semantically consistent outputs under permutation perturbations—without requiring any intervention during inference. Experimental results across seven benchmarks demonstrate that PA-GRPO significantly outperforms strong baselines, substantially reducing selection bias while maintaining or even improving overall performance.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).
Problem

Research questions and friction points this paper is trying to address.

selection bias
large language models
multiple-choice tasks
pairwise evaluation
permutation consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

selection bias
permutation consistency
relative policy optimization
LLM debiasing
group-based training
🔎 Similar Papers
No similar papers found.
J
Jinquan Zheng
School of Economics and Management, East China Normal University
Jia Yuan
Jia Yuan
University of Macau
Jiacheng Yao
Jiacheng Yao
Southeast University
Wireless communicationDistributed learning in wireless networks
Chenyang Gu
Chenyang Gu
Undergraduate, Peking University
Embodied AIRobotic Manipulation
P
Pujun Zheng
School of Economics and Management, East China Normal University
G
Guoxiu He
School of Economics and Management, East China Normal University