GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields

📅 2025-06-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D language field methods struggle to scale to city-level scenes and lack compositional geographic reasoning capabilities. To address this, we propose Geographic-aware City-scale 3D Language Fields (GCLF), a novel framework integrating a hierarchical, memory-efficient 3D language field with Geographic Vision APIs (GV-APIs). GCLF enables, for the first time, compositional natural-language-based geographic reasoning at city scale. It jointly encodes directional, distance, elevation, and landmark information to support dynamic program generation and multi-granularity spatial filtering. Evaluated on our newly constructed benchmark GeoEval3D, GCLF consistently outperforms state-of-the-art 3D language fields and multimodal large models across diverse tasks—including localization, spatial reasoning, comparison, counting, and measurement—demonstrating substantial improvements in language-driven visual understanding within large-scale, complex urban environments.

Technology Category

Application Category

📝 Abstract
The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity city-scale 3D environments via natural language. The code is available at https://snskysk.github.io/GeoProg3D/.
Problem

Research questions and friction points this paper is trying to address.

Enables natural language interactions with city-scale 3D scenes
Addresses scalability and compositional reasoning in urban environments
Introduces geographic-aware tools for large-scale 3D data processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical 3D model for city-scale data
Geographic vision APIs for urban tasks
LLM-driven dynamic API and 3D field operation
🔎 Similar Papers
No similar papers found.
S
Shunsuke Yasuki
Rikkyo University
Taiki Miyanishi
Taiki Miyanishi
The University of Tokyo
Computer VisionInternet of ThingsInformation Retrieval
N
Nakamasa Inoue
Institute of Science Tokyo
Shuhei Kurita
Shuhei Kurita
National Institute of Informatics
Deep LearningLarge Language ModelsComputer Vision
K
Koya Sakamoto
University of Tokyo
D
Daichi Azuma
University of Tokyo, Sony Semiconductor Solutions
M
Masato Taki
Rikkyo University
Y
Yutaka Matsuo
University of Tokyo