AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing rule-based or LLM-as-a-Judge approaches struggle to reliably verify agent behavior in complex environments and exhibit limited generalization. This work proposes AJ-Bench, the first Agent-as-a-Judge evaluation benchmark designed for environment-aware assessment, wherein judging agents actively interact with the environment and tools to gather verifiable evidence. The benchmark systematically evaluates agents’ capabilities in information acquisition and process verification across 155 tasks and 516 annotated trajectories spanning search, data systems, and graphical user interfaces. Experimental results demonstrate that this approach significantly outperforms conventional baselines and uncovers key challenges and open problems in agent-based verification.

Technology Category

Application Category

📝 Abstract

As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Agent-as-a-Judge

environment-aware evaluation

reinforcement learning

behavior verification

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-as-a-Judge

environment-aware evaluation

reinforcement learning