Dataset creation for supervised deep learning-based analysis of microscopic images - review of important considerations and recommendations

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Supervised learning on microscopic images faces significant challenges in dataset construction—including resource intensity, time-consuming annotation, substantial domain shift, and high risk of annotation bias. Method: This paper systematically examines critical stages—image acquisition, multi-expert collaborative annotation, and software selection—and proposes a quality control framework centered on the “three C” principles (Correctness, Completeness, Consistency), alongside a reusable Standard Operating Procedure (SOP). It integrates quality control protocols, cross-center annotation validation, and open-sharing mechanisms. Contribution/Results: The approach markedly enhances dataset reliability and generalizability. It advances the development of high-quality, large-scale, open-source pathological image datasets, improves reproducibility and clinical applicability of deep learning models, and provides a practical, implementation-ready guideline for digital pathology.

Technology Category

Application Category

📝 Abstract

Supervised deep learning (DL) receives great interest for automated analysis of microscopic images with an increasing body of literature supporting its potential. The development and validation of those DL models relies heavily on the availability of high-quality, large-scale datasets. However, creating such datasets is a complex and resource-intensive process, often hindered by challenges such as time constraints, domain variability, and risks of bias in image collection and label creation. This review provides a comprehensive guide to the critical steps in dataset creation, including: 1) image acquisition, 2) selection of annotation software, and 3) annotation creation. In addition to ensuring a sufficiently large number of images, it is crucial to address sources of image variability (domain shifts) - such as those related to slide preparation and digitization - that could lead to algorithmic errors if not adequately represented in the training data. Key quality criteria for annotations are the three "C"s: correctness, completeness, and consistency. This review explores methods to enhance annotation quality through the use of advanced techniques that mitigate the limitations of single annotators. To support dataset creators, a standard operating procedure (SOP) is provided as supplemental material, outlining best practices for dataset development. Furthermore, the article underscores the importance of open datasets in driving innovation and enhancing reproducibility of DL research. By addressing the challenges and offering practical recommendations, this review aims to advance the creation of and availability to high-quality, large-scale datasets, ultimately contributing to the development of generalizable and robust DL models for pathology applications.

Problem

Research questions and friction points this paper is trying to address.

Creating high-quality datasets for supervised deep learning in microscopic image analysis.

Addressing challenges like domain shifts and annotation quality in dataset development.

Providing guidelines and SOPs to improve dataset creation for pathology applications.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reviewing dataset creation steps for microscopic image analysis

Proposing annotation quality criteria: correctness, completeness, consistency

Providing SOP for high-quality, large-scale dataset development

🔎 Similar Papers

No similar papers found.