From Pixels to Predicates Structuring urban perception with scene graphs

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing urban visual perception modeling primarily relies on pixel-level features or object co-occurrence statistics, neglecting explicit semantic relationships among scene elements—leading to poor interpretability and limited cross-city generalizability. To address this, we propose a three-stage structured modeling framework: (1) open-vocabulary panoptic scene graph parsing based on OpenPSG; (2) relation-aware embedding learning via a heterogeneous graph masked autoencoder (GraphMAE); and (3) neural regression for predicting six human-perception metrics. This work is the first to integrate open-set scene graphs with graph self-supervised learning for urban visual perception modeling. It achieves significant improvements in prediction accuracy (average +26%) and cross-city generalization. Moreover, it identifies critical negative relational patterns—e.g., “graffiti on wall” and “vehicles parked on sidewalk”—enabling highly interpretable, semantically structured urban perception representations.

Technology Category

Application Category

📝 Abstract
Perception research is increasingly modelled using streetscapes, yet many approaches still rely on pixel features or object co-occurrence statistics, overlooking the explicit relations that shape human perception. This study proposes a three stage pipeline that transforms street view imagery (SVI) into structured representations for predicting six perceptual indicators. In the first stage, each image is parsed using an open-set Panoptic Scene Graph model (OpenPSG) to extract object predicate object triplets. In the second stage, compact scene-level embeddings are learned through a heterogeneous graph autoencoder (GraphMAE). In the third stage, a neural network predicts perception scores from these embeddings. We evaluate the proposed approach against image-only baselines in terms of accuracy, precision, and cross-city generalization. Results indicate that (i) our approach improves perception prediction accuracy by an average of 26% over baseline models, and (ii) maintains strong generalization performance in cross-city prediction tasks. Additionally, the structured representation clarifies which relational patterns contribute to lower perception scores in urban scenes, such as graffiti on wall and car parked on sidewalk. Overall, this study demonstrates that graph-based structure provides expressive, generalizable, and interpretable signals for modelling urban perception, advancing human-centric and context-aware urban analytics.
Problem

Research questions and friction points this paper is trying to address.

Transforms street view imagery into structured scene graphs
Predicts urban perception indicators using graph-based representations
Improves prediction accuracy and cross-city generalization over baselines
Innovation

Methods, ideas, or system contributions that make the work stand out.

OpenPSG model extracts object-predicate-object triplets
GraphMAE learns compact scene-level embeddings
Neural network predicts perception scores from embeddings
🔎 Similar Papers
No similar papers found.
Y
Yunlong Liu
School of Architecture, Southeast University, China
S
Shuyang Li
College of Design and Engineering, National University of Singapore, Singapore; Future Cities Laboratory, Singapore-ETH Centre, Singapore
Pengyuan Liu
Pengyuan Liu
University of Glasgow
GeoAIQuantitative Urban Geographies
Y
Yu Zhang
School of Architecture, Southeast University, China
Rudi Stouffs
Rudi Stouffs
National University of Singapore