🤖 AI Summary
Expert annotation of chest X-ray images is costly and prone to diagnostic bias. Method: This paper proposes a novel, non-expert crowdsourcing paradigm for rapid annotation of anatomical and device-level visual features (e.g., chest tubes)—rather than diagnostic labels—and introduces NEATX, a new benchmark dataset. It designs a pathology-consistency evaluation framework integrating YOLO and RetinaNet for object detection, and quantifies inter-annotator reliability using Cohen’s and Fleiss’ kappa. The framework yields 4.5k newly annotated catheter instances on NIH-CXR14 and PadChest. Contribution/Results: Detectors trained solely on non-expert annotations generalize robustly to expert-labeled data; inter-annotator agreement reaches moderate to near-perfect levels (κ = 0.40–0.89). This work establishes a reproducible, low-bias, and cost-effective methodology for medical image data curation, accompanied by an empirically validated benchmark.
📝 Abstract
The advancement of machine learning algorithms in medical image analysis requires the expansion of training datasets. A popular and cost-effective approach is automated annotation extraction from free-text medical reports, primarily due to the high costs associated with expert clinicians annotating medical images, such as chest X-rays. However, it has been shown that the resulting datasets are susceptible to biases and shortcuts. Another strategy to increase the size of a dataset is crowdsourcing, a widely adopted practice in general computer vision with some success in medical image analysis. In a similar vein to crowdsourcing, we enhance two publicly available chest X-ray datasets by incorporating non-expert annotations. However, instead of using diagnostic labels, we annotate shortcuts in the form of tubes. We collect 3.5k chest drain annotations for NIH-CXR14, and 1k annotations for four different tube types in PadChest, and create the Non-Expert Annotations of Tubes in X-rays (NEATX) dataset. We train a chest drain detector with the non-expert annotations that generalizes well to expert labels. Moreover, we compare our annotations to those provided by experts and show"moderate"to"almost perfect"agreement. Finally, we present a pathology agreement study to raise awareness about the quality of ground truth annotations. We make our dataset available at https://zenodo.org/records/14944064 and our code available at https://github.com/purrlab/chestxr-label-reliability.