🤖 AI Summary
Traditional scene change detection (SCD) assumes strictly aligned reference and query views, yet real-world reference images often exhibit viewpoint discrepancies, degrading performance. To address this, we introduce Environment Change Detection (ECD)—a new task targeting *unaligned* real-world scenarios, where changes are identified solely via environmental semantic cues rather than geometric correspondence. Our method proposes a multi-reference candidate fusion framework that eliminates rigid spatial matching. It integrates multi-reference retrieval and aggregation, semantic feature alignment, self-supervised environmental representation learning, and cross-view invariance modeling to achieve robust environment-consistent change reasoning. Evaluated on three newly constructed ECD benchmarks, our approach significantly outperforms state-of-the-art SCD methods and approaches the oracle performance achievable under perfect viewpoint alignment—demonstrating the effectiveness and robustness of semantics-driven ECD.
📝 Abstract
Humans do not memorize everything. Thus, humans recognize scene changes by exploring the past images. However, available past (i.e., reference) images typically represent nearby viewpoints of the present (i.e., query) scene, rather than the identical view. Despite this practical limitation, conventional Scene Change Detection (SCD) has been formalized under an idealized setting in which reference images with matching viewpoints are available for every query. In this paper, we push this problem toward a practical task and introduce Environmental Change Detection (ECD). A key aspect of ECD is to avoid unrealistically aligned query-reference pairs and rely solely on environmental cues. Inspired by real-world practices, we provide these cues through a large-scale database of uncurated images. To address this new task, we propose a novel framework that jointly understands spatial environments and detects changes. The main idea is that matching at the same spatial locations between a query and a reference may lead to a suboptimal solution due to viewpoint misalignment and limited field-of-view (FOV) coverage. We deal with this limitation by leveraging multiple reference candidates and aggregating semantically rich representations for change detection. We evaluate our framework on three standard benchmark sets reconstructed for ECD, and significantly outperform a naive combination of state-of-the-art methods while achieving comparable performance to the oracle setting. The code will be released upon acceptance.