🤖 AI Summary
This work addresses the challenges of instance segmentation in plankton microscopic images, where scarce annotations, cluttered backgrounds, and overlapping organisms hinder performance. To overcome these issues, the authors propose PlankFormer, a novel framework that integrates a generative pseudo-community image synthesis strategy to augment training data and combines a Mask2Former decoder with a Vision Transformer pretrained via Masked Autoencoders (MAE) in a self-supervised manner. This design substantially reduces reliance on pixel-level annotations. Experimental results on real-world datasets demonstrate that PlankFormer significantly outperforms baseline methods such as Mask R-CNN, particularly in high-clutter scenarios, achieving more robust and accurate instance segmentation of plankton.
📝 Abstract
Plankton monitoring is essential for assessing aquatic ecosystems but is limited by the labor-intensive nature of manual microscopic analysis. Automating the segmentation of plankton from crowded images is crucial, however, it faces two major challenges: (i) the scarcity of pixel-level annotated datasets and (ii) the difficulty of distinguishing plankton from debris and overlapping individuals using conventional CNN-based methods. To address these issues, we propose PlankFormer, a novel framework for plankton instance segmentation. First, to overcome the data shortage, we introduce a method to generate labeled Pseudo Community Images (PCI) by synthesizing individual plankton images onto diverse backgrounds, including those created by generative models. Second, we propose a segmentation model utilizing a Vision Transformer (ViT) backbone with a Mask2Former decoder. To robustly capture the global structural features of plankton against occlusion and debris, we employ a Masked Autoencoder (MAE) for self-supervised pre-training on unlabeled individual images. Experimental results on real-world datasets demonstrate that our method significantly outperforms conventional methods, such as Mask R-CNN, particularly in challenging environments with high debris density. We demonstrate that our synthetic training strategy and MAE-based architecture enable high-precision segmentation with requiring less manual annotations for individual plankton images.