FoodMem: Near Real-time and Precise Food Video Segmentation

📅 2024-07-16

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

Food video segmentation suffers from low accuracy, severe flickering, and poor real-time performance—limiting its utility in nutritional analysis, agricultural monitoring, and food processing quality inspection. To address these challenges, this paper proposes a two-stage architecture: (1) Transformer-based initial semantic segmentation, followed by (2) memory-augmented temporal mask propagation for robust tracking. We introduce the first large-scale, manually annotated dataset covering diverse foods, complex illumination conditions, specular reflections, and multi-view scenarios. A lightweight video sequence modeling mechanism is further designed to balance accuracy and inference efficiency. Experiments demonstrate that our method achieves a 2.5% average mAP improvement across multiple benchmarks, accelerates inference by 58×, effectively suppresses segmentation jitter, eliminates artifacts, and recovers occluded regions. To our knowledge, this is the first approach enabling high-accuracy, near-real-time segmentation and robust tracking of food videos in unconstrained 360° open-world scenes.

Technology Category

Application Category

📝 Abstract

Food segmentation, including in videos, is vital for addressing real-world health, agriculture, and food biotechnology issues. Current limitations lead to inaccurate nutritional analysis, inefficient crop management, and suboptimal food processing, impacting food security and public health. Improving segmentation techniques can enhance dietary assessments, agricultural productivity, and the food production process. This study introduces the development of a robust framework for high-quality, near-real-time segmentation and tracking of food items in videos, using minimal hardware resources. We present FoodMem, a novel framework designed to segment food items from video sequences of 360-degree unbounded scenes. FoodMem can consistently generate masks of food portions in a video sequence, overcoming the limitations of existing semantic segmentation models, such as flickering and prohibitive inference speeds in video processing contexts. To address these issues, FoodMem leverages a two-phase solution: a transformer segmentation phase to create initial segmentation masks and a memory-based tracking phase to monitor food masks in complex scenes. Our framework outperforms current state-of-the-art food segmentation models, yielding superior performance across various conditions, such as camera angles, lighting, reflections, scene complexity, and food diversity. This results in reduced segmentation noise, elimination of artifacts, and completion of missing segments. Here, we also introduce a new annotated food dataset encompassing challenging scenarios absent in previous benchmarks. Extensive experiments conducted on MetaFood3D, Nutrition5k, and Vegetables&Fruits datasets demonstrate that FoodMem enhances the state-of-the-art by 2.5% mean average precision in food video segmentation and is 58 x faster on average.

Problem

Research questions and friction points this paper is trying to address.

Enhances food video segmentation accuracy

Improves real-time food tracking efficiency

Addresses food security and public health

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based segmentation for initial masks

Memory-based tracking for complex scenes

Improved speed and accuracy in segmentation

🔎 Similar Papers

No similar papers found.