Using tournaments to calculate AUROC for zero-shot classification with LLMs

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) lack tunable decision boundaries in zero-shot classification, hindering fair comparison with supervised models. To address this, we propose PairElo—a framework that reformulates zero-shot binary classification as a pairwise comparison task. PairElo is the first to adapt competitive tournament mechanisms and the Elo rating system to this setting: it elicits relative preferences via pairwise prompting, then refines instance-level confidence rankings through dynamic Elo updates and a comparison scheduling algorithm. This enables computation of supervised evaluation metrics—such as AUROC—in purely zero-shot settings and yields richer ranking and uncertainty information than standard logits. Experiments across multiple benchmarks demonstrate that PairElo significantly improves AUROC, achieves higher ranking quality with fewer comparisons, and supports fair, unified evaluation between LLMs and supervised models.

Technology Category

Application Category

📝 Abstract
Large language models perform surprisingly well on many zero-shot classification tasks, but are difficult to fairly compare to supervised classifiers due to the lack of a modifiable decision boundary. In this work, we propose and evaluate a method that converts binary classification tasks into pairwise comparison tasks, obtaining relative rankings from LLMs. Repeated pairwise comparisons can be used to score instances using the Elo rating system (used in chess and other competitions), inducing a confidence ordering over instances in a dataset. We evaluate scheduling algorithms for their ability to minimize comparisons, and show that our proposed algorithm leads to improved classification performance, while also providing more information than traditional zero-shot classification.
Problem

Research questions and friction points this paper is trying to address.

AUROC calculation for zero-shot classification
Pairwise comparison method for LLMs
Elo rating system for confidence ordering
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for zero-shot classification
Pairwise comparison tasks conversion
Elo rating system for scoring
🔎 Similar Papers
No similar papers found.
WonJin Yoon
WonJin Yoon
Boston Children's Hospital, Harvard University
BioNLPClinical NLPNatural Language Processing
I
Ian Bulovic
Boston Children’s Hospital
T
Timothy A. Miller
Boston Children’s Hospital, Harvard Medical School