🤖 AI Summary
This work proposes a multimodal large language model (LLM)-based approach for nationwide automated building condition assessment, significantly reducing reliance on extensive human annotation. By fine-tuning the Gemma-3 27B model on a small set of human-labeled data and incorporating Google Street View imagery, the method achieves predictions highly aligned with human judgments—outperforming individual annotators in both Spearman’s rank correlation coefficient (SRCC) and Pearson’s linear correlation coefficient (PLCC). It pioneers the application of multimodal LLMs to built environment attribute evaluation and employs knowledge distillation to develop a lightweight 4B model that retains performance while tripling inference speed. Further integration with EfficientNetV2-M and SwinV2-B yields a 30-fold acceleration. An accompanying visualization dashboard facilitates downstream analysis, advancing interpretable AI applications in urban governance.
📝 Abstract
We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.