Learning to Detect Baked Goods with Limited Supervision

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high cost of manual monitoring in the German baking industry—driven by product diversity and short shelf lives—and the prohibitive expense of fully supervised object detection annotation. To this end, we propose a weakly supervised learning framework that integrates open-vocabulary detectors (OWLv2 and Grounding DINO) for initial localization, adopts YOLOv11 as the backbone model, and leverages Segment Anything 2 to propagate pseudo-labels across video frames. Using only image-level labels, our approach achieves a mean average precision (mAP) of 0.91. After pseudo-label refinement, it demonstrates a 19.3% performance gain under non-ideal deployment conditions, surpassing a fully supervised baseline while substantially reducing annotation dependency and enhancing viewpoint robustness.

Technology Category

Application Category

📝 Abstract
Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.
Problem

Research questions and friction points this paper is trying to address.

object detection
limited supervision
baked goods
weakly supervised learning
computer vision
Innovation

Methods, ideas, or system contributions that make the work stand out.

weakly supervised object detection
open-vocabulary detection
pseudo-label propagation
limited supervision
YOLOv11
🔎 Similar Papers
No similar papers found.