Scene Change Detection with Vision-Language Representation Learning

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the challenge of scene change detection in urban environments, where variations in illumination, season, viewpoint, and complex spatial layouts hinder accurate identification of semantic-level changes. Existing approaches relying on low-level visual features often fall short in capturing meaningful alterations. To overcome this limitation, the authors propose LangSCD, a novel framework that introduces language-based semantic reasoning into scene change detection for the first time. LangSCD leverages a vision-language model to generate natural language descriptions of detected changes and incorporates a cross-modal feature enhancer alongside a geometry-semantic alignment module to improve detection accuracy. The study also introduces NYC-CD, the first large-scale real-world street-view dataset annotated with multi-category semantic changes. Experimental results demonstrate that LangSCD significantly outperforms state-of-the-art methods across multiple street-view benchmarks, validating the efficacy of language-guided reasoning in enhancing detection robustness.

Technology Category

Application Category

📝 Abstract

Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by enforcing semantic consistency and spatial completeness. Existing real-world scene change detection benchmarks provide only binary change annotations, which are insufficient for downstream applications requiring fine-grained understanding of scene dynamics. To address this limitation, we introduce NYC-CD, a large-scale dataset of 8,122 real-world image pairs collected in New York City with multiclass change annotations generated through a semi-automatic pipeline. Extensive experiments across multiple street-view benchmarks demonstrate that our language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance and highlighting the value of integrating linguistic reasoning with visual representations for robust scene change detection.

Problem

Research questions and friction points this paper is trying to address.

Scene Change Detection

Urban Monitoring

Vision-Language Representation

Multiclass Change Annotation

Real-World Environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language representation

scene change detection

cross-modal feature fusion