🤖 AI Summary
This work addresses the high cost of manual monitoring in the German baking industry—driven by product diversity and short shelf lives—and the prohibitive expense of fully supervised object detection annotation. To this end, we propose a weakly supervised learning framework that integrates open-vocabulary detectors (OWLv2 and Grounding DINO) for initial localization, adopts YOLOv11 as the backbone model, and leverages Segment Anything 2 to propagate pseudo-labels across video frames. Using only image-level labels, our approach achieves a mean average precision (mAP) of 0.91. After pseudo-label refinement, it demonstrates a 19.3% performance gain under non-ideal deployment conditions, surpassing a fully supervised baseline while substantially reducing annotation dependency and enhancing viewpoint robustness.
📝 Abstract
Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.