🤖 AI Summary
High-quality, AI-ready data are critically needed for biological image analysis; however, existing repositories—such as the Image Data Resource (IDR) and BioImage Archive—offer rich metadata but lack machine learning–oriented standardization, resulting in time-consuming, error-prone preprocessing. To address this, we propose the first AI-training-optimized framework for standardized biological image dataset publication. Built on a Python workflow, it integrates OME-NGFF parsing, Zarr format conversion, seamless interfacing with the Hugging Face Datasets API, and automated metadata enrichment. This enables end-to-end, one-click conversion and unified packaging of images, annotations, and metadata into interoperable, training-ready formats. The framework drastically reduces data preparation time and supports direct loading into deep learning pipelines. Its efficacy is empirically validated across multiple open-source biological image benchmarks, demonstrating robust compatibility, scalability, and reproducibility.
📝 Abstract
Modern bioimage analysis approaches are data hungry, making it necessary for researchers to scavenge data beyond those collected within their (bio)imaging facilities. In addition to scale, bioimaging datasets must be accompanied with suitable, high-quality annotations and metadata. Although established data repositories such as the Image Data Resource (IDR) and BioImage Archive offer rich metadata, their contents typically cannot be directly consumed by image analysis tools without substantial data wrangling. Such a tedious assembly and conversion of (meta)data can account for a dedicated amount of time investment for researchers, hindering the development of more powerful analysis tools. Here, we introduce BioimageAIpub, a workflow that streamlines bioimaging data conversion, enabling a seamless upload to HuggingFace, a widely used platform for sharing machine learning datasets and models.