Towards Multi-Stakeholder Evaluation of ML Models: A Crowdsourcing Study on Metric Preferences in Job-matching System

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Recruitment matching systems lack standardized evaluation criteria and struggle to balance the diverse interests of stakeholders (e.g., job seekers, employers). Method: We propose a crowdsourced empirical approach for selecting multi-stakeholder evaluation metrics, grounded in 20 rounds of pairwise preference experiments involving 837 participants. We quantitatively measure utility preferences across seven core dimensions—including accuracy, fairness, and transparency—and integrate utility modeling with K-means clustering to identify five distinct preference clusters. Contribution/Results: We uncover statistically significant associations between these clusters and demographic attributes (e.g., age, education) and job-seeking status. Based on these findings, we design a clustering-driven, multi-perspective metric selection framework. This work advances fair, interpretable, and collaborative machine learning evaluation by providing both theoretical grounding and large-scale empirical evidence for stakeholder-aligned assessment methodologies.

Technology Category

Application Category

📝 Abstract
While machine learning (ML) technology affects diverse stakeholders, there is no one-size-fits-all metric to evaluate the quality of outputs, including performance and fairness. Using predetermined metrics without soliciting stakeholder opinions is problematic because it leads to an unfair disregard for stakeholders in the ML pipeline. In this study, to establish practical ways to incorporate diverse stakeholder opinions into the selection of metrics for ML, we investigate participants' preferences for different metrics by using crowdsourcing. We ask 837 participants to choose a better model from two hypothetical ML models in a hypothetical job-matching system twenty times and calculate their utility values for seven metrics. To examine the participants' feedback in detail, we divide them into five clusters based on their utility values and analyze the tendencies of each cluster, including their preferences for metrics and common attributes. Based on the results, we discuss the points that should be considered when selecting appropriate metrics and evaluating ML models with multiple stakeholders.
Problem

Research questions and friction points this paper is trying to address.

No universal metric for ML model evaluation.
Stakeholder opinions often ignored in metric selection.
Crowdsourcing used to gather diverse metric preferences.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Crowdsourcing to gather stakeholder metric preferences
Clustering participants based on utility values
Analyzing metric preferences for job-matching ML models