Understanding Image2Video Domain Shift in Food Segmentation: An Instance-level Analysis on Apples

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study addresses the limitations of existing food segmentation models trained on static images when applied to video, where the absence of temporal consistency leads to unstable instance identities and inaccurate counting. For the first time in food segmentation, this work introduces instance-level temporal analysis through a tracking-by-matching framework to evaluate video performance, revealing a critical disconnect between per-frame accuracy and temporal stability. The investigation identifies illumination changes, specular reflections, and texture ambiguity as primary causes of mask flickering and identity fragmentation. To mitigate these issues without requiring full-video supervision, the authors propose a post-processing temporal regularization method combined with a self-supervised consistency optimization strategy. These approaches significantly enhance temporal coherence, thereby exposing the fundamental limitations of image-centric training paradigms in video-based scenarios.

Technology Category

Application Category

📝 Abstract

Food segmentation models trained on static images have achieved strong performance on benchmark datasets; however, their reliability in video settings remains poorly understood. In real-world applications such as food monitoring and instance counting, segmentation outputs must be temporally consistent, yet image-trained models often break down when deployed on videos. In this work, we analyze this failure through an instance segmentation and tracking perspective, focusing on apples as a representative food category. Models are trained solely on image-level food segmentation data and evaluated on video sequences using an instance segmentation with tracking-by-matching framework, enabling object-level temporal analysis. Our results reveal that high frame-wise segmentation accuracy does not translate to stable instance identities over time. Temporal appearance variations, particularly illumination changes, specular reflections, and texture ambiguity, lead to mask flickering and identity fragmentation, resulting in significant errors in apple counting. These failures are largely overlooked by conventional image-based metrics, which substantially overestimate real-world video performance. Beyond diagnosing the problem, we examine practical remedies that do not require full video supervision, including post-hoc temporal regularization and self-supervised temporal consistency objectives. Our findings suggest that the root cause of failure lies in image-centric training objectives that ignore temporal coherence, rather than model capacity. This study highlights a critical evaluation gap in food segmentation research and motivates temporally-aware learning and evaluation protocols for video-based food analysis.

Problem

Research questions and friction points this paper is trying to address.

domain shift

temporal consistency

instance segmentation

food segmentation

video analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal consistency

instance segmentation

domain shift