🤖 AI Summary
This work addresses the vulnerability of large language models (LLMs) acting as recommendation agents in high-stakes scenarios, where contextual biases can severely compromise recommendation reliability. To systematically evaluate this issue, we introduce BiasRecBench—the first benchmark specifically designed to assess the bias robustness of LLM-based recommenders. Leveraging a controlled quality gap and logically consistent bias synthesis methodology, BiasRecBench enables rigorous evaluation across real-world tasks such as paper reviewing, e-commerce, and hiring. Extensive experiments reveal that both leading commercial models—including Gemini-2.5/3-Pro, GPT-4o—and smaller-scale LLMs like DeepSeek-R1 are significantly susceptible to contextual biases. These findings highlight a critical reliability bottleneck: even LLMs with strong reasoning capabilities remain prone to biased recommendations in practical deployment settings.
📝 Abstract
Current Large Language Models (LLMs) are gradually exploited in practically valuable agentic workflows such as Deep Research, E-commerce recommendation, and job recruitment. In these applications, LLMs need to select some optimal solutions from massive candidates, which we term as \textit{LLM-as-a-Recommender} paradigm. However, the reliability of using LLM agents for recommendations is underexplored. In this work, we introduce a \textbf{Bias} \textbf{Rec}ommendation \textbf{Bench}mark (\textbf{BiasRecBench}) to highlight the critical vulnerability of such agents to biases in high-value real-world tasks. The benchmark includes three practical domains: paper review, e-commerce, and job recruitment. We construct a \textsc{Bias Synthesis Pipeline with Calibrated Quality Margins} that 1) synthesizes evaluation data by controlling the quality gap between optimal and sub-optimal options to provide a calibrated testbed to elicit the vulnerability to biases; 2) injects contextual biases that are logical and suitable for option contexts. Extensive experiments on both SOTA (Gemini-{2.5,3}-pro, GPT-4o, DeepSeek-R1) and small-scale LLMs reveal that agents frequently succumb to injected biases despite having sufficient reasoning capabilities to identify the ground truth. These findings expose a significant reliability bottleneck in current agentic workflows, calling for specialized alignment strategies for LLM-as-a-Recommender. The complete code and evaluation datasets will be made publicly available shortly.