PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the challenges of instance segmentation in plankton microscopic images, where scarce annotations, cluttered backgrounds, and overlapping organisms hinder performance. To overcome these issues, the authors propose PlankFormer, a novel framework that integrates a generative pseudo-community image synthesis strategy to augment training data and combines a Mask2Former decoder with a Vision Transformer pretrained via Masked Autoencoders (MAE) in a self-supervised manner. This design substantially reduces reliance on pixel-level annotations. Experimental results on real-world datasets demonstrate that PlankFormer significantly outperforms baseline methods such as Mask R-CNN, particularly in high-clutter scenarios, achieving more robust and accurate instance segmentation of plankton.

Technology Category

Application Category

📝 Abstract

Plankton monitoring is essential for assessing aquatic ecosystems but is limited by the labor-intensive nature of manual microscopic analysis. Automating the segmentation of plankton from crowded images is crucial, however, it faces two major challenges: (i) the scarcity of pixel-level annotated datasets and (ii) the difficulty of distinguishing plankton from debris and overlapping individuals using conventional CNN-based methods. To address these issues, we propose PlankFormer, a novel framework for plankton instance segmentation. First, to overcome the data shortage, we introduce a method to generate labeled Pseudo Community Images (PCI) by synthesizing individual plankton images onto diverse backgrounds, including those created by generative models. Second, we propose a segmentation model utilizing a Vision Transformer (ViT) backbone with a Mask2Former decoder. To robustly capture the global structural features of plankton against occlusion and debris, we employ a Masked Autoencoder (MAE) for self-supervised pre-training on unlabeled individual images. Experimental results on real-world datasets demonstrate that our method significantly outperforms conventional methods, such as Mask R-CNN, particularly in challenging environments with high debris density. We demonstrate that our synthetic training strategy and MAE-based architecture enable high-precision segmentation with requiring less manual annotations for individual plankton images.

Problem

Research questions and friction points this paper is trying to address.

plankton instance segmentation

pixel-level annotation scarcity

occlusion and debris

crowded microscopic images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pseudo Community Image

Masked Autoencoder

Vision Transformer

Instance Segmentation

Self-supervised Pre-training

🔎 Similar Papers

GMT: Guided Mask Transformer for Leaf Instance Segmentation

2024-06-24arXiv.orgCitations: 0

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)