BenchSeg: A Large-Scale Dataset and Benchmark for Multi-View Food Video Segmentation

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing food image segmentation methods struggle to generalize to novel viewpoints due to the lack of multi-view data, limiting the accuracy of dietary analysis. To address this, this work introduces BenchSeg, a new dataset comprising 55 dishes with 25,284 densely annotated frames from 360° free-viewpoint videos, establishing the first large-scale benchmark for multi-view food video segmentation. The study systematically evaluates combinations of 20 state-of-the-art segmentation architectures—including SAM, Transformers, CNNs, and multimodal models—with video memory modules such as XMem2. Experimental results show that SeTR-MLA paired with XMem2 achieves a 2.63% mAP improvement over FoodMem, demonstrating the efficacy of memory mechanisms in enhancing temporal consistency and segmentation performance under viewpoint variations, thereby advancing the integration of segmentation and tracking in dietary analysis.

Technology Category

Application Category

📝 Abstract

Food image segmentation is a critical task for dietary analysis, enabling accurate estimation of food volume and nutrients. However, current methods suffer from limited multi-view data and poor generalization to new viewpoints. We introduce BenchSeg, a novel multi-view food video segmentation dataset and benchmark. BenchSeg aggregates 55 dish scenes (from Nutrition5k, Vegetables&Fruits, MetaFood3D, and FoodKit) with 25,284 meticulously annotated frames, capturing each dish under free 360{\deg} camera motion. We evaluate a diverse set of 20 state-of-the-art segmentation models (e.g., SAM-based, transformer, CNN, and large multimodal) on the existing FoodSeg103 dataset and evaluate them (alone and combined with video-memory modules) on BenchSeg. Quantitative and qualitative results demonstrate that while standard image segmenters degrade sharply under novel viewpoints, memory-augmented methods maintain temporal consistency across frames. Our best model based on a combination of SeTR-MLA+XMem2 outperforms prior work (e.g., improving over FoodMem by ~2.63% mAP), offering new insights into food segmentation and tracking for dietary analysis. In addition to frame-wise spatial accuracy, we introduce a dedicated temporal evaluation protocol that explicitly quantifies segmentation stability over time through continuity, flicker rate, and IoU drift metrics. This allows us to reveal failure modes that remain invisible under standard per-frame evaluations. We release BenchSeg to foster future research. The project page including the dataset annotations and the food segmentation models can be found at https://amughrabi.github.io/benchseg.

Problem

Research questions and friction points this paper is trying to address.

food segmentation

multi-view

dietary analysis

viewpoint generalization

video segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view food segmentation

video memory module

temporal consistency