SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video scene graph generation (VidSGG) methods for first-person kitchen videos suffer from heavy reliance on large-scale training data and exhibit temporal inconsistency in object identity tracking across frames—particularly when leveraging multimodal large language models (e.g., Gemini). To address this, we propose the first training-free, zero-shot VidSGG framework. Our method uniquely integrates SAM2-based temporal mask propagation with Gemini’s multimodal semantic parsing, and introduces a novel graph-mask matching algorithm to achieve fine-grained object binding and semantic alignment—preserving precise bounding-box localization while substantially improving cross-frame object identity consistency. Evaluated on EPIC-KITCHENS and EPIC-KITCHENS-100, our approach achieves an 8.33% absolute improvement in mean recall over Gemini, demonstrating for the first time the feasibility of generating high-quality, temporally consistent scene graphs in dynamic, complex environments under a zero-shot paradigm.

Technology Category

Application Category

📝 Abstract
Video Scene Graph Generation (VidSGG) is an important topic in understanding dynamic kitchen environments. Current models for VidSGG require extensive training to produce scene graphs. Recently, Vision Language Models (VLM) and Vision Foundation Models (VFM) have demonstrated impressive zero-shot capabilities in a variety of tasks. However, VLMs like Gemini struggle with the dynamics for VidSGG, failing to maintain stable object identities across frames. To overcome this limitation, we propose SAMJAM, a zero-shot pipeline that combines SAM2's temporal tracking with Gemini's semantic understanding. SAM2 also improves upon Gemini's object grounding by producing more accurate bounding boxes. In our method, we first prompt Gemini to generate a frame-level scene graph. Then, we employ a matching algorithm to map each object in the scene graph with a SAM2-generated or SAM2-propagated mask, producing a temporally-consistent scene graph in dynamic environments. Finally, we repeat this process again in each of the following frames. We empirically demonstrate that SAMJAM outperforms Gemini by 8.33% in mean recall on the EPIC-KITCHENS and EPIC-KITCHENS-100 datasets.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot video scene graph generation in kitchens
Maintaining object identities across video frames
Improving object grounding and temporal consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines SAM2 tracking with Gemini semantics
Uses matching algorithm for consistent object identities
Improves object grounding with accurate bounding boxes
🔎 Similar Papers
No similar papers found.
J
Joshua Li
Vision and Image Processing Lab, University of Waterloo
F
Fernando Jose Pena Cantu
Vision and Image Processing Lab, University of Waterloo
Emily Yu
Emily Yu
Assistant Professor, Leiden University
Neural controlHardware verificationAutomated Reasoning
Alexander Wong
Alexander Wong
Canada Research Chair FIET FInstP FRSPH FRSM FRGS FGS FRSA FISDDE, University of Waterloo
Artificial IntelligenceMachine LearningImage ProcessingComputer VisionMedical Imaging
Y
Yuchen Cui
Robot Intelligence Lab, UCLA
Y
Yuhao Chen
Vision and Image Processing Lab, University of Waterloo