A Semi-Self-Supervised Approach for Dense-Pattern Video Object Segmentation

📅 2024-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video object segmentation (VOS) in agricultural videos faces significant challenges due to the dense, minute, dynamically occluded, and highly motion-intensive nature of crop parts (e.g., wheat spikes). Method: This paper proposes a semi-supervised spatiotemporal segmentation framework tailored for dense agricultural scenes. It introduces the first semi-supervised diffusion-based approach for dense-pattern VOS, jointly optimizing reconstruction and segmentation tasks. Leveraging synthetic data pretraining and spatiotemporal pseudo-label propagation, it drastically reduces reliance on frame-wise fine-grained annotations. A lightweight spatiotemporal feature alignment network further enhances temporal consistency. Results: Evaluated on a real-world UAV-captured wheat video dataset, the method achieves a Dice score of 0.79. It demonstrates strong generalization across handheld recordings, multi-field scenarios, and the entire growth cycle. Moreover, the framework is readily extensible to other crops, human crowd analysis, and microscopic image segmentation.

Technology Category

Application Category

📝 Abstract
Video object segmentation (VOS) -- predicting pixel-level regions for objects within each frame of a video -- is particularly challenging in agricultural scenarios, where videos of crops include hundreds of small, dense, and occluded objects (stems, leaves, flowers, pods) that sway and move unpredictably in the wind. Supervised training is the state-of-the-art for VOS, but it requires large, pixel-accurate, human-annotated videos, which are costly to produce for videos with many densely packed objects in each frame. To address these challenges, we proposed a semi-self-supervised spatiotemporal approach for dense-VOS (DVOS) using a diffusion-based method through multi-task (reconstruction and segmentation) learning. We train the model first with synthetic data that mimics the camera and object motion of real videos and then with pseudo-labeled videos. We evaluate our DVOS method for wheat head segmentation from a diverse set of videos (handheld, drone-captured, different field locations, and different growth stages -- spanning from Boot-stage to Wheat-mature and Harvest-ready). Despite using only a few manually annotated video frames, the proposed approach yielded a high-performing model, achieving a Dice score of 0.79 when tested on a drone-captured external test set. While our method was evaluated on wheat head segmentation, it can be extended to other crops and domains, such as crowd analysis or microscopic image analysis.
Problem

Research questions and friction points this paper is trying to address.

Segmenting dense, occluded objects in agricultural videos
Reducing reliance on costly manual pixel-level annotations
Improving accuracy in unpredictable motion scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-self-supervised spatiotemporal approach for DVOS
Diffusion-based multi-task learning for segmentation
Training with synthetic data and pseudo-labeled videos
🔎 Similar Papers
2024-03-05IEEE transactions on circuits and systems for video technology (Print)Citations: 0
K
Keyhan Najafian
Department of Computer Science, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
Farhad Maleki
Farhad Maleki
Assistant Professor, University of Calgary
AIMachine LearningDeep LearningData ScienceBioinformatics
Ian Stavness
Ian Stavness
Professor, Computer Science, University of Saskatchewan
Computer GraphicsComputer VisionModeling & SimulationComputational Agriculture
L
Lingling Jin
Department of Computer Science, University of Saskatchewan, Saskatoon, Saskatchewan, Canada