Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Evaluating general-purpose agents is hindered by high costs, inter-task correlations, and stochasticity, necessitating large sample sizes for accurate rankings. This work formalizes the problem of active evaluation for general agents and introduces an online active evaluation framework that dynamically selects tasks and agents for scoring while continuously updating rankings in real time. The framework integrates the Elo rating system, a Soft Condorcet optimization criterion, and a proportional representation task selection strategy. Experiments on synthetic data and Atari agents demonstrate that Elo provides stable performance, whereas Soft Condorcet significantly outperforms Elo in realistic settings. Moreover, the proportional representation strategy effectively accelerates convergence of ranking error under high task variability.

Technology Category

Application Category

📝 Abstract

As intelligent agents become more generally-capable, i.e. able to master a wide variety of tasks, the complexity and cost of properly evaluating them rises significantly. Tasks that assess specific capabilities of the agents can be correlated and stochastic, requiring many samples for accurate comparisons, leading to added costs. In this paper, we propose a formal definition and a conceptual framework for active evaluation of agents across multiple tasks, which assesses the performance of ranking algorithms as a function of number of evaluation data samples. Rather than curating, filtering, or compressing existing data sets as a preprocessing step, we propose an online framing: on every iteration, the ranking algorithm chooses the task and agents to sample scores from. Then, evaluation algorithms report a ranking of agents on each iteration and their performance is assessed with respect to the ground truth ranking over time. Several baselines are compared under different experimental contexts, with synthetic generated data and simulated online access to real evaluation data from Atari game-playing agents. We find that the classical Elo rating system -- while it suffers from well-known failure modes, in theory -- is a consistently reliable choice for efficient reduction of ranking error in practice. A recently-proposed method, Soft Condorcet Optimization, shows comparable performance to Elo on synthetic data and significantly outperforms Elo on real Atari agent evaluation. When task variation from the ground truth is high, selecting tasks based on proportional representation leads to higher rate of ranking error reduction.

Problem

Research questions and friction points this paper is trying to address.

agent evaluation

active evaluation

ranking algorithms

general agents

evaluation efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

active evaluation

online task selection

agent ranking