🤖 AI Summary
Existing AI models struggle to achieve fine-grained perception of the social, functional, and spatial attributes of urban environments, often failing to effectively distinguish critical concepts such as activated versus non-activated public spaces or indoor versus outdoor settings. To address this gap, this work integrates urban theory into AI perception benchmarks and introduces HUSIC—the first large-scale, multi-scale, cross-modal dataset for urban space evaluation—spanning 24 cities, 61 locations, and comprising two million Weibo posts with paired images and text. The study proposes a hierarchical taxonomy with ten semantic categories and supports three core tasks: semantic classification, image-text retrieval, and instance segmentation. Experimental results show that models perform well in supervised classification but face significant challenges in cross-modal retrieval and instance segmentation, while also revealing a marked performance gain as training data scales from 1K to 100K samples.
📝 Abstract
We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.