What should an AI assessor optimise for?

📅 2025-02-01

📈 Citations: 0

✨ Influential: 0

career value

129K/year

🤖 AI Summary

This study investigates the efficacy of target metric selection and optimization strategies in AI evaluation: specifically, whether to directly optimize final performance metrics (e.g., absolute error, spherical score) or instead leverage more informative—but non-target—proxy metrics. Through systematic experiments on regression and classification tasks across 20 tabular datasets, we compare approaches including monotonic/non-monotonic mappings, loss reweighting, and post-hoc calibration. Counterintuitively, optimizing “more informative” proxy metrics does not consistently improve target metric performance. In contrast, specific monotonic transformations—such as using logistic loss to predict regression error or substituting log-score for accuracy—yield substantial gains: absolute error decreases by up to 12.7%, and spherical score improves by 9.3%. These findings underscore the critical role of loss function design in evaluator training and establish a novel pathway toward robust, transferable AI evaluation paradigms.

Technology Category

Application Category

📝 Abstract

An AI assessor is an external, ideally indepen-dent system that predicts an indicator, e.g., a loss value, of another AI system. Assessors can lever-age information from the test results of many other AI systems and have the flexibility of be-ing trained on any loss function or scoring rule: from squared error to toxicity metrics. Here we address the question: is it always optimal to train the assessor for the target metric? Or could it be better to train for a different metric and then map predictions back to the target metric? Us-ing twenty regression and classification problems with tabular data, we experimentally explore this question for, respectively, regression losses and classification scores with monotonic and non-monotonic mappings and find that, contrary to intuition, optimising for more informative met-rics is not generally better. Surprisingly, some monotonic transformations are promising. For example, the logistic loss is useful for minimis-ing absolute or quadratic errors in regression, and the logarithmic score helps maximise quadratic or spherical scores in classification.

Problem

Research questions and friction points this paper is trying to address.

Artificial Intelligence Evaluation

Alternative Assessment Standards

Performance Metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse AI Evaluation

Logarithmic Loss

Logarithmic Scoring

🔎 Similar Papers

No similar papers found.