π€ AI Summary
Existing benchmarks struggle to evaluate the ability of large language model (LLM) agents to autonomously design and execute a complete reinforcement learning (RL) post-training pipeline, particularly lacking assessments of interactive RL engineering. To address this gap, this work proposes AgentΒ² RL-Bench, a multi-tiered benchmark comprising six tasks that span a complexity gradient from static rule-based training to closed-loop online RL. The benchmark enables the first automated diagnosis of agent-driven RL post-training through isolated workspaces, structured task hierarchies, runtime code tracing, and automated analysis reporting. It integrates diverse agent architectures and six LLMs for systematic evaluation. Experiments show that on ALFWorld, agents pretrained with supervised fine-tuning (SFT) and further trained online with GRPO achieve performance gains from 5.97 to 93.28; however, improvements are marginal on tasks like DeepSearchQA, and the choice of driving LLM significantly impacts performance in interactive settings.
π Abstract
We introduce Agent^2 RL-Bench, a benchmark for evaluating agentic RL post-training -- whether LLM agents can autonomously design, implement, and run complete RL pipelines that improve foundation models. This capability is important because RL post-training increasingly drives model alignment and specialization, yet existing benchmarks remain largely static: supervised fine-tuning alone yields strong results, leaving interactive RL engineering untested. Agent^2 RL-Bench addresses this with six tasks across three levels -- from static rule-based training to closed-loop online RL with trajectory collection -- each adding a structural requirement that prior levels do not impose. The benchmark provides isolated workspaces with a grading API, runtime instrumentation that records every submission and code revision, and automated post-hoc analysis that generates structured run reports, enabling the first automated diagnostic of agent-driven post-training behavior. Across multiple agent stacks spanning five agent systems and six driver LLMs, we find that agents achieve striking interactive gains -- on ALFWorld, an RL-only agent improves from 5.97 to 93.28 via SFT warm-up and GRPO with online rollouts -- yet make only marginal progress on others (DeepSearchQA: +2.75 within evaluation noise), and that driver choice has a large effect on interactive tasks -- within the same scaffold, switching drivers changes interactive improvement from near-zero to +78pp. More broadly, the benchmark reveals that supervised pipelines dominate agent-driven post-training under fixed budgets, with online RL succeeding as the final best route only on ALFWorld. Code is available at https://github.com/microsoft/RD-Agent/tree/main/rdagent/scenarios/rl/autorl_bench.