🤖 AI Summary
This work addresses unsupervised scene change detection by proposing a zero-shot video change detection method that requires neither training nor adaptation to novel scenes. The method leverages a target tracking model to perform cross-frame, zero-shot comparison between a reference frame and a query frame, enabling pixel-level localization of co-occurring, newly appearing, or vanished objects. To mitigate inter-frame discrepancies in content and style, it introduces an adaptive content thresholding mechanism and a style bridging layer. Furthermore, temporal information across video frames is explicitly integrated to enhance robustness. Evaluated on multi-domain benchmarks, the approach significantly outperforms supervised baselines while exhibiting strong generalization—without fine-tuning or domain adaptation. It establishes a scalable, open-world paradigm for change detection under unconstrained, real-world conditions.
📝 Abstract
We present a novel, training-free approach to scene change detection. Our method leverages tracking models, which inherently perform change detection between consecutive frames of video by identifying common objects and detecting new or missing objects. Specifically, our method takes advantage of the change detection effect of the tracking model by inputting reference and query images instead of consecutive frames. Furthermore, we focus on the content gap and style gap between two input images in change detection, and address both issues by proposing adaptive content threshold and style bridging layers, respectively. Finally, we extend our approach to video, leveraging rich temporal information to enhance the performance of scene change detection. We compare our approach and baseline through various experiments. While existing train-based baseline tend to specialize only in the trained domain, our method shows consistent performance across various domains, proving the competitiveness of our approach.