From NL2SQL to NL2GeoSQL: GeoSQL-Eval for automated evaluation of LLMs on PostGIS queries

πŸ“… 2025-09-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
The lack of systematic evaluation frameworks for natural language-to-SQL (NL2SQL) translation in PostGIS-enabled spatial databases hinders progress in geospatial AI. Method: We propose GeoSQL-Eval, the first end-to-end automated evaluation framework, accompanied by GeoSQL-Benchβ€”a standardized, domain-diverse benchmark comprising 14,178 natural language questions, 340 PostGIS-specific functions, and 82 real-world spatial databases. Grounded in Webb’s Depth of Knowledge (DOK) model, we design a four-dimensional, five-level, twenty-category evaluation taxonomy covering knowledge acquisition, syntactic generation, semantic alignment, and execution robustness. Leveraging entropy-weighted scoring and statistical analysis, we systematically assess 24 state-of-the-art LLMs. Contributions: (1) We fill a critical gap in geospatial NL2SQL evaluation; (2) we release the first PostGIS-specific, publicly available benchmark and leaderboard; and (3) we establish an interpretable, extensible evaluation paradigm to advance model optimization and real-world deployment in geoinformatics and urban spatial analytics.

Technology Category

Application Category

πŸ“ Abstract
In recent years, large language models (LLMs) have achieved remarkable progress in natural language understanding and structured query generation (NL2SQL). However, extending these advances to GeoSQL tasks in the PostGIS environment remains challenging due to the complexity of spatial functions, geometric data types, and execution semantics. Existing evaluations primarily focus on general relational databases or Google Earth Engine code generation, leaving a lack of systematic benchmarks tailored to spatial databases. To address this gap, this study introduces GeoSQL-Eval, the first end-to-end automated evaluation framework for PostGIS query generation. Built upon Webb's Depth of Knowledge (DOK) model, the framework encompasses four cognitive dimensions, five proficiency levels, and twenty task categories, providing a comprehensive assessment of model performance in terms of knowledge acquisition, syntactic generation, semantic alignment, execution accuracy, and robustness. In parallel, we developed GeoSQL-Bench, a benchmark dataset comprising 14178 questions that span three task types, 340 PostGIS functions, and 82 domain-specific databases. Leveraging this framework, we systematically evaluated 24 representative models across six categories, applying entropy-weighting and statistical analyses to reveal differences in performance, error distributions, and resource consumption patterns. Furthermore, we established a public GeoSQL-Eval leaderboard that enables global research teams to conduct ongoing testing and comparison. These contributions not only extend the boundaries of NL2SQL applications but also provide a standardized, interpretable, and scalable framework for evaluating LLM performance in spatial database contexts, offering valuable insights for model optimization and applications in geographic information science, urban studies, and spatial analysis.
Problem

Research questions and friction points this paper is trying to address.

Extending NL2SQL to handle complex PostGIS spatial queries and functions
Addressing the lack of systematic benchmarks for spatial database evaluation
Providing automated assessment of LLMs on geographic query generation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed GeoSQL-Eval automated evaluation framework for PostGIS
Created GeoSQL-Bench dataset with 14178 spatial questions
Established public leaderboard for ongoing LLM spatial testing
πŸ”Ž Similar Papers
No similar papers found.
S
Shuyang Hou
State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China
Haoyue Jiao
Haoyue Jiao
Wuhan University
GeoAILarge Language ModelCode Generation
Z
Ziqi Liu
State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China
L
Lutong Xie
State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China
G
Guanyu Chen
School of Resource and Environmental Sciences, Wuhan University, Wuhan, China
S
Shaowen Wu
State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China
Xuefeng Guan
Xuefeng Guan
Professor, Wuhan University
High-performance GeoComputationBig-data AnalyticsSpatial Data Mining
Huayi Wu
Huayi Wu
Wuhan University
GISremote sensingcartographyGeomatics