Designing streetscapes from street-view imagery using diffusion models

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing street view generation methods struggle to support urban planning scenarios due to their primary focus on evaluating real-world landscapes and limited capacity to produce viable alternatives. This work introduces diffusion models into controllable street view synthesis for the first time, constructing a multimodal dataset that integrates street view images, semantic segmentation maps, road masks, and textual prompts. The authors propose a dual-control mechanism combining textual and visual guidance and demonstrate that visual control predominates under conflicting conditions. Experiments on Chicago and Orlando datasets show substantial improvements: mIoU increases by 46.4% and 23.7%, respectively; LPIPS decreases by approximately 6%; and building-related view indices improve by over 100%. These results significantly enhance photorealism and semantic consistency, offering a novel paradigm for generative urban design.

📝 Abstract

Street-view imagery (SVI) is widely used to quantify key indicators of urban environment, such as green- ery, sky, or road view indices. However, existing studies largely focus on measuring current streetscapes and rarely support the generation of alternative and non-existing urban scenarios, which is a core task in geospatial disciplines such as urban planning and design. To address this gap, we propose a gener- ative multimodal AI framework that synthesizes alternative streetscapes conditioned on targeted visual metrics, enabling direct visual exploration of urban scenarios. We first construct a multimodal dataset that aligns SVIs with textual descriptions, segmentation maps, road masks, and quantitative metrics of visual elements in Chicago and Orlando. Using this dataset, we demonstrate that diffusion models can produce realistic and semantically consistent streetscape imagery while responding to both textual and imagery controls. Our quantitative evaluations show that incorporating visual controls can improve semantic consistency, reducing the LPIPS index by approximately 6% while maintaining global visual realism. In addition, overall semantic consistency increases by 23.7% in Orlando and 46.4% in Chicago, as measured by the mIoU index, with class-wise gains exceeding even 100% improvement for building view indices. Streetscape generation can be controlled in a fine-grained manner by both visual and textual prompts, and when textual and visual controls conflict, imagery controls consistently dominate, indicating a clear control hierarchy and the importance of further developing visual controls for urban scene generation. Overall, this work establishes an important benchmark for streetscape generation us- ing SVIs and diffusion models, and illustrates how generative AI can serve as a practical, scalable, and controllable approach for urban scenario exploration.

Problem

Research questions and friction points this paper is trying to address.

street-view imagery

urban planning

generative AI

streetscape generation

diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion models

street-view imagery

multimodal generation