BioimageAIpub: a toolbox for AI-ready bioimaging data publishing

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

High-quality, AI-ready data are critically needed for biological image analysis; however, existing repositories—such as the Image Data Resource (IDR) and BioImage Archive—offer rich metadata but lack machine learning–oriented standardization, resulting in time-consuming, error-prone preprocessing. To address this, we propose the first AI-training-optimized framework for standardized biological image dataset publication. Built on a Python workflow, it integrates OME-NGFF parsing, Zarr format conversion, seamless interfacing with the Hugging Face Datasets API, and automated metadata enrichment. This enables end-to-end, one-click conversion and unified packaging of images, annotations, and metadata into interoperable, training-ready formats. The framework drastically reduces data preparation time and supports direct loading into deep learning pipelines. Its efficacy is empirically validated across multiple open-source biological image benchmarks, demonstrating robust compatibility, scalability, and reproducibility.

Technology Category

Application Category

📝 Abstract

Modern bioimage analysis approaches are data hungry, making it necessary for researchers to scavenge data beyond those collected within their (bio)imaging facilities. In addition to scale, bioimaging datasets must be accompanied with suitable, high-quality annotations and metadata. Although established data repositories such as the Image Data Resource (IDR) and BioImage Archive offer rich metadata, their contents typically cannot be directly consumed by image analysis tools without substantial data wrangling. Such a tedious assembly and conversion of (meta)data can account for a dedicated amount of time investment for researchers, hindering the development of more powerful analysis tools. Here, we introduce BioimageAIpub, a workflow that streamlines bioimaging data conversion, enabling a seamless upload to HuggingFace, a widely used platform for sharing machine learning datasets and models.

Problem

Research questions and friction points this paper is trying to address.

Streamlines bioimaging data conversion for AI

Enables seamless upload to HuggingFace platform

Reduces data wrangling for analysis tools

Innovation

Methods, ideas, or system contributions that make the work stand out.

Toolbox for AI-ready bioimaging data publishing

Workflow streamlining bioimaging data conversion

Enabling seamless upload to HuggingFace platform

🔎 Similar Papers

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs