ArchGPT: Understanding the World's Architectures with Large Multimodal Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing VR/MR/AR applications in architecture rely on hard-coded annotations and bespoke interactions, suffering from poor scalability and limited generalizability. Method: We propose the first scalable visual question answering (VQA) data construction paradigm tailored for architecture: leveraging 3D reconstruction and semantic segmentation for coarse-to-fine image filtering, augmented by LLM-driven textual validation and knowledge distillation to automatically generate high-quality, culture-aware image–question–answer triplets. Contribution/Results: Based on this pipeline, we introduce Arch-300K—a domain-specific VQA dataset comprising 315,000 samples—and fine-tune ShareGPT4V-7B to obtain ArchGPT, a multimodal architectural VQA model. Experiments demonstrate that ArchGPT significantly outperforms general-purpose baselines on architectural VQA benchmarks, achieving the first fine-grained, generalizable, and culturally grounded interactive understanding of global architectural styles, structural features, and cultural semantics.

Technology Category

Application Category

📝 Abstract

Architecture embodies aesthetic, cultural, and historical values, standing as a tangible testament to human civilization. Researchers have long leveraged virtual reality (VR), mixed reality (MR), and augmented reality (AR) to enable immersive exploration and interpretation of architecture, enhancing accessibility, public understanding, and creative workflows around architecture in education, heritage preservation, and professional design practice. However, existing VR/MR/AR systems are often developed case-by-case, relying on hard-coded annotations and task-specific interactions that do not scale across diverse built environments. In this work, we present ArchGPT, a multimodal architectural visual question answering (VQA) model, together with a scalable data-construction pipeline for curating high-quality, architecture-specific VQA annotations. This pipeline yields Arch-300K, a domain-specialized dataset of approximately 315,000 image-question-answer triplets. Arch-300K is built via a multi-stage process: first, we curate architectural scenes from Wikimedia Commons and filter unconstrained tourist photo collections using a novel coarse-to-fine strategy that integrates 3D reconstruction and semantic segmentation to select occlusion-free, structurally consistent architectural images. To mitigate noise and inconsistency in raw textual metadata, we propose an LLM-guided text verification and knowledge-distillation pipeline to generate reliable, architecture-specific question-answer pairs. Using these curated images and refined metadata, we further synthesize formal analysis annotations-including detailed descriptions and aspect-guided conversations-to provide richer semantic variety while remaining faithful to the data. We perform supervised fine-tuning of an open-source multimodal backbone ,ShareGPT4V-7B, on Arch-300K, yielding ArchGPT.

Problem

Research questions and friction points this paper is trying to address.

Developing scalable multimodal VQA models for architectural understanding

Creating automated pipeline for architecture-specific dataset construction

Overcoming limitations of case-by-case VR/MR/AR systems in architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal VQA model for architectural understanding

Coarse-to-fine image filtering with 3D reconstruction

LLM-guided pipeline for reliable QA generation

🔎 Similar Papers

Have Large Vision-Language Models Mastered Art History?