Dodgersort: Uncertainty-Aware VLM-Guided Human-in-the-Loop Pairwise Ranking

📅 2026-03-21

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the high cost (quadratic complexity) and limited reliability of traditional pairwise comparison labeling by proposing an active ranking framework that integrates vision-language models with human feedback. The approach uniquely combines CLIP-based hierarchical pre-ranking, a neural ranking head, probabilistic ensemble modeling via Elo, Bradley–Terry–Luce (BTL), and Gaussian Process (GP) formulations, decomposition of cognitive and aleatoric uncertainty, and an information-gain-driven sample selection strategy. This integration enables, for the first time, synergistic optimization through neural adaptation, multi-model uncertainty ensembling, and information-theoretic guidance. Evaluated on medical imaging, historical dating, and aesthetic ranking tasks, the method reduces annotation effort by 11–16% while significantly improving inter-rater agreement. On FG-NET, it achieves 5–20× higher information gain per pairwise comparison than baselines, attaining Pareto-optimal trade-offs between accuracy and efficiency.

Technology Category

Application Category

📝 Abstract

Pairwise comparison labeling is emerging as it yields higher inter-rater reliability than conventional classification labeling, but exhaustive comparisons require quadratic cost. We propose Dodgersort, which leverages CLIP-based hierarchical pre-ordering, a neural ranking head and probabilistic ensemble (Elo, BTL, GP), epistemic--aleatoric uncertainty decomposition, and information-theoretic pair selection. It reduces human comparisons while improving the reliability of the rankings. In visual ranking tasks in medical imaging, historical dating, and aesthetics, Dodgersort achieves a 11--16\% annotation reduction while improving inter-rater reliability. Cross-domain ablations across four datasets show that neural adaptation and ensemble uncertainty are key to this gain. In FG-NET with ground-truth ages, the framework extracts 5--20$\times$ more ranking information per comparison than baselines, yielding Pareto-optimal accuracy--efficiency trade-offs.

Problem

Research questions and friction points this paper is trying to address.

pairwise comparison

annotation efficiency

ranking reliability

human-in-the-loop

uncertainty-aware

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty-aware ranking

Human-in-the-loop

Pairwise comparison