ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of optimizing reinforcement learning (RL) in open-domain agent tasks, where sparse and coarse reward signals hinder effective learning. To overcome this limitation, the authors propose ArenaRL, a novel framework that replaces traditional point-wise rewards with intra-group relative rankings derived from multi-level scoring criteria and a single-elimination tournament mechanism. This approach efficiently approximates the accuracy of full pairwise comparisons—typically requiring O(N²) complexity—with only O(N) computational overhead. ArenaRL further establishes the first comprehensive benchmark for open-domain agent workflows, spanning from supervised fine-tuning to evaluation, through two new environments: Open-Travel and Open-DeepResearch. The framework also introduces process-aware pairwise evaluation and an adversarial arena design. Experimental results demonstrate that ArenaRL significantly outperforms standard RL baselines, yielding more robust and higher-quality solutions on complex tasks.

Technology Category

Application Category

📝 Abstract
Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.
Problem

Research questions and friction points this paper is trying to address.

open-ended tasks
reinforcement learning
reward model
discrimination collapse
relative ranking
Innovation

Methods, ideas, or system contributions that make the work stand out.

tournament-based ranking
relative preference learning
process-aware evaluation
open-ended agent benchmarking
discrimination collapse
🔎 Similar Papers
No similar papers found.
Q
Qiang Zhang
Tongyi Lab, Alibaba Group
Boli Chen
Boli Chen
University College London
Systems and ControlOptimizationSmart Cities
F
Fanrui Zhang
Tongyi Lab, Alibaba Group
R
Ruixue Ding
Tongyi Lab, Alibaba Group
Shihang Wang
Shihang Wang
DAMO Academy, Alibaba Inc.
Natural Language Processing
Qiuchen Wang
Qiuchen Wang
University of Science and Technology of China
Computer VisionLarge Language Model
Yin Huang
Yin Huang
Research Assistant, University of Florida
Multi-Armed BanditsEdge ComputingWireless CommunicationsQuantum Networking
H
Haonan Zhang
Amap, Alibaba Group
R
Rongxiang Zhu
Amap, Alibaba Group
P
Pengyong Wang
Amap, Alibaba Group
A
Ailin Ren
Amap, Alibaba Group
Xin Li
Xin Li
Alibaba Group
natural language processing
P
Peng Xie
Tongyi Lab, Alibaba Group
J
Jiawei Liu
Tongyi Lab, Alibaba Group
N
Ning Guo
Amap, Alibaba Group
J
Jing-Ying Zhou
Tongyi Lab, Alibaba Group
Z
Zheng-Jun Zha
Tongyi Lab, Alibaba Group