🤖 AI Summary
In educational assessment, comparative judgment (CJ) excels at holistic ranking but struggles to support criterion-based, multidimensional competency decomposition and fine-grained feedback—revealing a methodological gap between holistic and standards-aligned evaluation. To bridge this, we propose the Multi-Criteria Bayesian Comparative Judgment (MC-BCJ) framework: the first to extend Bayesian preference modeling to jointly infer multiple independent learning outcomes, enabling simultaneous generation of holistic rankings and criterion-specific ordinal rankings while quantifying predictive uncertainty. MC-BCJ innovatively integrates information-entropy-driven active learning, multi-output ordinal regression, and interpretable rater-consistency inference. Evaluated on synthetic and real-world educational datasets, MC-BCJ achieves significantly higher annotation efficiency and superior per-criterion prediction accuracy. It delivers holistic and criterion-level rankings with calibrated confidence intervals, alongside explicit consistency metrics—effectively closing the critical gap between holistic judgment and standards-oriented assessment.
📝 Abstract
Comparative Judgement (CJ) provides an alternative assessment approach by evaluating work holistically rather than breaking it into discrete criteria. This method leverages human ability to make nuanced comparisons, yielding more reliable and valid assessments. CJ aligns with real-world evaluations, where overall quality emerges from the interplay of various elements. However, rubrics remain widely used in education, offering structured criteria for grading and detailed feedback. This creates a gap between CJ's holistic ranking and the need for criterion-based performance breakdowns. This paper addresses this gap using a Bayesian approach. We build on Bayesian CJ (BCJ) by Gray et al., which directly models preferences instead of using likelihoods over total scores, allowing for expected ranks with uncertainty estimation. Their entropy-based active learning method selects the most informative pairwise comparisons for assessors. We extend BCJ to handle multiple independent learning outcome (LO) components, defined by a rubric, enabling both holistic and component-wise predictive rankings with uncertainty estimates. Additionally, we propose a method to aggregate entropies and identify the most informative comparison for assessors. Experiments on synthetic and real data demonstrate our method's effectiveness. Finally, we address a key limitation of BCJ, which is the inability to quantify assessor agreement. We show how to derive agreement levels, enhancing transparency in assessment.