Soft Condorcet Optimization for Ranking of General Agents

📅 2024-10-31

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

273K/year

🤖 AI Summary

This paper addresses the challenge of ranking general-purpose AI agents across diverse tasks. We propose the Soft Condorcet Optimization (SCO) framework, which models agent comparisons as noisy votes and seeks the optimal total order by minimizing prediction error. SCO is the first method to incorporate the Condorcet criterion into AI agent evaluation: when a Condorcet winner exists, SCO guarantees top-rank assignment—unlike Elo and related methods, which lack this theoretical guarantee. Built upon maximum likelihood estimation, SCO introduces three scalable scoring algorithms, with the Kemeny–Young rule serving as its theoretical foundation; ranking quality is quantified via normalized Kendall-tau distance. Empirical evaluation shows an average rank deviation of only 0–0.043 across 865 preference profiles from PrefLib; SCO significantly outperforms mainstream baselines under high-noise conditions (59% data missing in tournament settings); and it achieves state-of-the-art performance on a real-world Diplomacy dataset comprising 53,000 players and 31,000 games.

Technology Category

Application Category

📝 Abstract

Driving progress of AI models and agents requires comparing their performance on standardized benchmarks; for general agents, individual performances must be aggregated across a potentially wide variety of different tasks. In this paper, we describe a novel ranking scheme inspired by social choice frameworks, called Soft Condorcet Optimization (SCO), to compute the optimal ranking of agents: the one that makes the fewest mistakes in predicting the agent comparisons in the evaluation data. This optimal ranking is the maximum likelihood estimate when evaluation data (which we view as votes) are interpreted as noisy samples from a ground truth ranking, a solution to Condorcet's original voting system criteria. SCO ratings are maximal for Condorcet winners when they exist, which we show is not necessarily true for the classical rating system Elo. We propose three optimization algorithms to compute SCO ratings and evaluate their empirical performance. When serving as an approximation to the Kemeny-Young voting method, SCO rankings are on average 0 to 0.043 away from the optimal ranking in normalized Kendall-tau distance across 865 preference profiles from the PrefLib open ranking archive. In a simulated noisy tournament setting, SCO achieves accurate approximations to the ground truth ranking and the best among several baselines when 59% or more of the preference data is missing. Finally, SCO ranking provides the best approximation to the optimal ranking, measured on held-out test sets, in a problem containing 52,958 human players across 31,049 games of the classic seven-player game of Diplomacy.

Problem

Research questions and friction points this paper is trying to address.

Optimize ranking of general AI agents.

Aggregate performance across diverse tasks.

Maximize accuracy in predicting agent comparisons.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft Condorcet Optimization

Maximum likelihood ranking

Kemeny-Young approximation

🔎 Similar Papers

Multi-Player Approaches for Dueling Bandits

2024-05-25arXiv.orgCitations: 1

TikTok

San Jose, California

Authors to Follow