Weakly Supervised Food Image Segmentation using Vision Transformers and Segment Anything Model

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

To address the high cost of pixel-level annotations in food image semantic segmentation, this paper proposes a weakly supervised approach that generates high-quality food region masks using only image-level labels. The method leverages a Swin Transformer to produce Class Activation Maps (CAMs), which are automatically converted into point and bounding-box prompts for the Segment Anything Model (SAM). Additionally, an image-adaptive preprocessing pipeline and a multi-mask fusion strategy are introduced to enhance segmentation robustness. Evaluated on the FoodSeg103 dataset, the method yields an average of 2.4 valid masks per image, achieving 0.54 mIoU with the multi-mask scheme—substantially outperforming existing weakly supervised methods. This work is the first to synergistically integrate ViT-based CAMs with SAM’s prompting mechanism for weakly supervised food segmentation, enabling high-accuracy, scalable segmentation without any pixel-level annotations and establishing a novel paradigm for real-world food analysis.

Technology Category

Application Category

📝 Abstract

In this paper, we propose a weakly supervised semantic segmentation approach for food images which takes advantage of the zero-shot capabilities and promptability of the Segment Anything Model (SAM) along with the attention mechanisms of Vision Transformers (ViTs). Specifically, we use class activation maps (CAMs) from ViTs to generate prompts for SAM, resulting in masks suitable for food image segmentation. The ViT model, a Swin Transformer, is trained exclusively using image-level annotations, eliminating the need for pixel-level annotations during training. Additionally, to enhance the quality of the SAM-generated masks, we examine the use of image preprocessing techniques in combination with single-mask and multi-mask SAM generation strategies. The methodology is evaluated on the FoodSeg103 dataset, generating an average of 2.4 masks per image (excluding background), and achieving an mIoU of 0.54 for the multi-mask scenario. We envision the proposed approach as a tool to accelerate food image annotation tasks or as an integrated component in food and nutrition tracking applications.

Problem

Research questions and friction points this paper is trying to address.

Segmenting food images using weak supervision without pixel-level annotations

Leveraging Vision Transformers and SAM for zero-shot food segmentation

Improving food mask quality through preprocessing and multi-mask strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining Vision Transformers with Segment Anything Model

Using class activation maps as prompts for SAM

Training with image-level instead of pixel-level annotations

🔎 Similar Papers

FoodMem: Near Real-time and Precise Food Video Segmentation