UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of specialized evaluation benchmarks for large language models (LLMs) in urban planning. To this end, we introduce UrbanPlanBench—the first comprehensive, domain-specific benchmark covering three core dimensions: planning principles, professional knowledge, and regulatory compliance. We propose a multidimensional evaluation framework and release UrbanPlanText, the largest supervised fine-tuning dataset for urban planning to date (30K+ instruction-response pairs), integrating domain-aware knowledge distillation and professional terminology–reasoning task modeling. Experimental results reveal that 70% of mainstream LLMs exhibit significant deficiencies in regulatory comprehension; fine-tuning substantially improves factual recall and basic understanding but yields limited gains in domain-specific reasoning—particularly with technical terminology. This study establishes the first professional-grade evaluation infrastructure for urban planning, providing both a methodological foundation and an empirical benchmark to advance trustworthy, vertical-domain deployment of LLMs.

Technology Category

Application Category

📝 Abstract
The advent of Large Language Models (LLMs) holds promise for revolutionizing various fields traditionally dominated by human expertise. Urban planning, a professional discipline that fundamentally shapes our daily surroundings, is one such field heavily relying on multifaceted domain knowledge and experience of human experts. The extent to which LLMs can assist human practitioners in urban planning remains largely unexplored. In this paper, we introduce a comprehensive benchmark, UrbanPlanBench, tailored to evaluate the efficacy of LLMs in urban planning, which encompasses fundamental principles, professional knowledge, and management and regulations, aligning closely with the qualifications expected of human planners. Through extensive evaluation, we reveal a significant imbalance in the acquisition of planning knowledge among LLMs, with even the most proficient models falling short of meeting professional standards. For instance, we observe that 70% of LLMs achieve subpar performance in understanding planning regulations compared to other aspects. Besides the benchmark, we present the largest-ever supervised fine-tuning (SFT) dataset, UrbanPlanText, comprising over 30,000 instruction pairs sourced from urban planning exams and textbooks. Our findings demonstrate that fine-tuned models exhibit enhanced performance in memorization tests and comprehension of urban planning knowledge, while there exists significant room for improvement, particularly in tasks requiring domain-specific terminology and reasoning. By making our benchmark, dataset, and associated evaluation and fine-tuning toolsets publicly available at https://github.com/tsinghua-fib-lab/PlanBench, we aim to catalyze the integration of LLMs into practical urban planning, fostering a symbiotic collaboration between human expertise and machine intelligence.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' efficacy in urban planning tasks
Assessing LLMs' knowledge imbalance in planning regulations
Enhancing LLM performance via domain-specific fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces UrbanPlanBench for LLM urban planning evaluation
Provides UrbanPlanText dataset with 30,000 instruction pairs
Offers public toolsets for LLM fine-tuning and assessment
🔎 Similar Papers
No similar papers found.
Y
Yu Zheng
Tsinghua University
L
Longyi Liu
University of Chinese Academy of Sciences
Yuming Lin
Yuming Lin
Tsinghua University
Urban ResilienceEnvironmental BehaviorData Mining
J
Jie Feng
Tsinghua University
Guozhen Zhang
Guozhen Zhang
Nanjing University
Video Frame Interpolation
D
Depeng Jin
Tsinghua University
Y
Yong Li
Tsinghua University