PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

169K/year
🤖 AI Summary
This work addresses the challenges of instance segmentation in plankton microscopic images, where scarce annotations, cluttered backgrounds, and overlapping organisms hinder performance. To overcome these issues, the authors propose PlankFormer, a novel framework that integrates a generative pseudo-community image synthesis strategy to augment training data and combines a Mask2Former decoder with a Vision Transformer pretrained via Masked Autoencoders (MAE) in a self-supervised manner. This design substantially reduces reliance on pixel-level annotations. Experimental results on real-world datasets demonstrate that PlankFormer significantly outperforms baseline methods such as Mask R-CNN, particularly in high-clutter scenarios, achieving more robust and accurate instance segmentation of plankton.

Technology Category

Application Category

📝 Abstract
Plankton monitoring is essential for assessing aquatic ecosystems but is limited by the labor-intensive nature of manual microscopic analysis. Automating the segmentation of plankton from crowded images is crucial, however, it faces two major challenges: (i) the scarcity of pixel-level annotated datasets and (ii) the difficulty of distinguishing plankton from debris and overlapping individuals using conventional CNN-based methods. To address these issues, we propose PlankFormer, a novel framework for plankton instance segmentation. First, to overcome the data shortage, we introduce a method to generate labeled Pseudo Community Images (PCI) by synthesizing individual plankton images onto diverse backgrounds, including those created by generative models. Second, we propose a segmentation model utilizing a Vision Transformer (ViT) backbone with a Mask2Former decoder. To robustly capture the global structural features of plankton against occlusion and debris, we employ a Masked Autoencoder (MAE) for self-supervised pre-training on unlabeled individual images. Experimental results on real-world datasets demonstrate that our method significantly outperforms conventional methods, such as Mask R-CNN, particularly in challenging environments with high debris density. We demonstrate that our synthetic training strategy and MAE-based architecture enable high-precision segmentation with requiring less manual annotations for individual plankton images.
Problem

Research questions and friction points this paper is trying to address.

plankton instance segmentation
pixel-level annotation scarcity
occlusion and debris
crowded microscopic images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pseudo Community Image
Masked Autoencoder
Vision Transformer
Instance Segmentation
Self-supervised Pre-training
🔎 Similar Papers
M
Masaharu Miyazaki
Graduate School of Information Sciences, Tohoku University, 6-6-05, Aramaki Aza Aoba, Sendai, 9808579, Japan.
Y
Yurie Otake
The Center for Ecological Research, Kyoto University, 2–509–3, Hirano, Otsu-shi, Shiga-ken, 5202113, Japan.
Koichi Ito
Koichi Ito
Associate Professor, Graduate School of Information Sciences, Tohoku University
Image ProcessingComputer VisionBiometrics
W
Wataru Makino
Graduate School of Life Sciences, Tohoku University, 6–3, Aramaki Aza Aoba, Aoba-ku, Sendai-shi, 9808578, Japan.
J
Jotaro Urabe
Graduate School of Life Sciences, Tohoku University, 6–3, Aramaki Aza Aoba, Aoba-ku, Sendai-shi, 9808578, Japan.
T
Takafumi Aoki
Graduate School of Information Sciences, Tohoku University, 6-6-05, Aramaki Aza Aoba, Sendai, 9808579, Japan.