🤖 AI Summary
This study addresses the lack of robustness validation for core metrics—particularly fairness measures—in responsible AI evaluation. We conduct a methodological reflection and empirical analysis through a systematic literature review, cross-domain case studies (recommender systems and AI for Science), and methodological synthesis. Our key contribution is the first principled, general-purpose guideline for developing reliable responsible AI metrics. Innovatively, we elevate fairness metric robustness research from isolated empirical checks to transferable design principles and a unified validation framework, thereby bridging a critical methodological gap in responsible AI assessment. The resulting non-exhaustive yet broadly applicable set of guidelines comprises three interrelated practice categories: metric design, sensitivity analysis, and contextual adaptation. These provide both theoretical grounding and actionable pathways for trustworthy AI evaluation. (136 words)
📝 Abstract
The development of Artificial Intelligence (AI), including AI in Science (AIS), should be done following the principles of responsible AI. Progress in responsible AI is often quantified through evaluation metrics, yet there has been less work on assessing the robustness and reliability of the metrics themselves. We reflect on prior work that examines the robustness of fairness metrics for recommender systems as a type of AI application and summarise their key takeaways into a set of non-exhaustive guidelines for developing reliable metrics of responsible AI. Our guidelines apply to a broad spectrum of AI applications, including AIS.