Multi-domain performance analysis with scores tailored to user preferences

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses two key limitations in multi-domain algorithm performance evaluation: (i) the neglect of user preferences in assessment, and (ii) the masking of domain-specific performance disparities by conventional arithmetic averaging. To this end, we propose a user-preference-parameterized weighted scoring framework. Methodologically, we introduce, for the first time, a continuous family of scoring functions to model performance distributions; integrate probability measures with normalized confusion matrices; rigorously define four critical domain types—easiest, hardest, dominant, and bottleneck—and prove that only specific scoring functions preserve weighted mean consistency. Our contributions include: (i) establishing a general theoretical foundation for multi-domain performance analysis; (ii) developing a visualization toolkit tailored to binary classification tasks; and (iii) enabling fine-grained, interpretable performance decomposition—thereby substantially enhancing transparency and practical utility in cross-domain evaluation. (149 words)

Technology Category

Application Category

📝 Abstract

The performance of algorithms, methods, and models tends to depend heavily on the distribution of cases on which they are applied, this distribution being specific to the applicative domain. After performing an evaluation in several domains, it is highly informative to compute a (weighted) mean performance and, as shown in this paper, to scrutinize what happens during this averaging. To achieve this goal, we adopt a probabilistic framework and consider a performance as a probability measure (e.g., a normalized confusion matrix for a classification task). It appears that the corresponding weighted mean is known to be the summarization, and that only some remarkable scores assign to the summarized performance a value equal to a weighted arithmetic mean of the values assigned to the domain-specific performances. These scores include the family of ranking scores, a continuum parameterized by user preferences, and that the weights to consider in the arithmetic mean depend on the user preferences. Based on this, we rigorously define four domains, named easiest, most difficult, preponderant, and bottleneck domains, as functions of user preferences. After establishing the theory in a general setting, regardless of the task, we develop new visual tools for two-class classification.

Problem

Research questions and friction points this paper is trying to address.

Analyzing multi-domain algorithm performance with user preference-based scores

Investigating weighted mean performance across different application domains

Defining domain difficulty and importance based on user preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic framework for performance as probability measure

Weighted mean summarization with user preference parameterization

Four domain definitions based on user preferences

🔎 Similar Papers

MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences