🤖 AI Summary
This work addresses the challenge of zero-shot object re-identification in first-person kitchen videos, where abrupt viewpoint changes, frequent occlusions, cluttered scenes, and large intra-class appearance variations severely hinder performance. To tackle this, the authors propose a multi-stage re-identification pipeline centered on SAM3 segmentation. For the first time, SAM3 is integrated as a core component, combined with fused DINOv2 and CLIP features, a joint similarity metric based on mask-shape IoU and cosine similarity, geometric consistency constraints, and k-reciprocal re-ranking—all operating without any labeled data. The method achieves a significant performance gain, attaining a 52.8% mAP on the EPIC-Kitchens benchmark, which represents a 7.5% improvement over the current best baseline.
📝 Abstract
Object re-identification (ReID) in egocentric kitchen videos is challenging due to rapid viewpoint changes, frequent occlusions, cluttered scenes, and large intra-class appearance variations. Objects may leave and re-enter the field of view, and the large diversity of instances with limited annotations makes supervised ReID difficult to scale, motivating zero-shot approaches. We study zero-shot object ReID on the EPIC-Kitchens benchmark, where the goal is to match active food and kitchen-tool instances across frames using only pre-trained visual features. We first evaluate five state-of-the-art feature extractors, including Vision-Language Models (VLMs) - CLIP, DINOv2, DreamSim, I-JEPA, and SAM3 - and show that zero-shot methods fail, with the best baseline achieving only 45.3% mAP. We then propose an Enhanced SAM3 ReID Pipeline, a zero-shot multi-stage method built around SAM3 segmentation as the core component. Stage 1 uses SAM3 to suppress background clutter. Stage 2 fuses embeddings from SAM3, DINOv2, and CLIP into a single L2-normalized descriptor. Stage 3 augments cosine similarity with mask-shape IoU for geometric consistency, and Stage 4 applies k-reciprocal re-ranking. The full pipeline improves performance by 7.5% mAP to 52.8%.