🤖 AI Summary
This work addresses the scarcity of medical imaging datasets that capture inter-annotator disagreement and provide an independent gold standard, which hinders objective evaluation of model robustness. To this end, the authors introduce CytoCrowd, a benchmark dataset comprising 446 high-resolution cytology images, each annotated independently by four pathologists and accompanied by an independent gold standard established by a senior expert. CytoCrowd is the first cytology image dataset to simultaneously offer multiple raw expert annotations and a separate reference standard, enabling joint evaluation of standard vision tasks—such as object detection and classification—and annotation aggregation algorithms. By releasing the dataset along with baseline results, this study establishes a realistic and quantifiable benchmark for investigating annotation inconsistency and evaluating fusion strategies, thereby advancing the development of robust medical image analysis models.
📝 Abstract
High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.