🤖 AI Summary
In subjective annotation tasks, pairwise comparisons incur high labeling costs (O(n²)) and suffer from reliability–efficiency trade-offs. To address this, we propose a zero-shot CLIP-driven human-in-the-loop ranking framework. Our method comprises three key components: (1) hierarchical zero-shot pre-ranking using CLIP to automatically resolve easily distinguishable samples; (2) bucket-aware Elo initialization combined with uncertainty-guided active sampling to prioritize high-information comparisons; and (3) a human-in-the-loop merge-sort algorithm that dynamically integrates model predictions with human feedback. Extensive evaluation across multiple datasets demonstrates that our approach reduces human annotation effort by 90.5% compared to exhaustive pairwise comparison, and further cuts labeling cost by 19.8% relative to the state-of-the-art (n = 100), while maintaining or improving rating consistency.
📝 Abstract
Pairwise comparison is often favored over absolute rating or ordinal classification in subjective or difficult annotation tasks due to its improved reliability. However, exhaustive comparisons require a massive number of annotations (O(n^2)). Recent work has greatly reduced the annotation burden (O(n log n)) by actively sampling pairwise comparisons using a sorting algorithm. We further improve annotation efficiency by (1) roughly pre-ordering items using the Contrastive Language-Image Pre-training (CLIP) model hierarchically without training, and (2) replacing easy, obvious human comparisons with automated comparisons. The proposed EZ-Sort first produces a CLIP-based zero-shot pre-ordering, then initializes bucket-aware Elo scores, and finally runs an uncertainty-guided human-in-the-loop MergeSort. Validation was conducted using various datasets: face-age estimation (FGNET), historical image chronology (DHCI), and retinal image quality assessment (EyePACS). It showed that EZ-Sort reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and by 19.8% compared to prior work (when n = 100), while improving or maintaining inter-rater reliability. These results demonstrate that combining CLIP-based priors with uncertainty-aware sampling yields an efficient and scalable solution for pairwise ranking.