🤖 AI Summary
This work proposes a general-purpose Julia-based toolkit for evaluating and ranking systems that produce multiple stochastic outputs in shared tasks. The toolkit introduces, for the first time in the Julia ecosystem, a unified tensor interface that integrates diverse ranking paradigms—including direct scoring, pairwise comparison, psychometric modeling, voting schemes, graph-based methods, and listwise approaches—enabling flexible cross-method comparison and analysis. Empirical evaluations on synthetic data demonstrate the toolkit’s superior performance in terms of ranking recovery accuracy, stability under limited trials, and runtime scalability. By providing a cohesive and efficient framework, this contribution addresses a critical need for standardized, reproducible evaluation of multi-output systems exhibiting inherent randomness.
📝 Abstract
Scorio.jl is a Julia package for evaluating and ranking systems from repeated responses to shared tasks. It provides a common tensor-based interface for direct score-based, pairwise, psychometric, voting, graph, and listwise methods, so the same benchmark can be analyzed under multiple ranking assumptions. We describe the package design, position it relative to existing Julia tools, and report pilot experiments on synthetic rank recovery, stability under limited trials, and runtime scaling.