CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of scarce 3D annotations, severe cross-dataset domain shift, and inconsistent labeling—leading to poor generalization in city-scale point cloud semantic segmentation—this paper proposes an open-vocabulary zero-shot semantic segmentation framework. Methodologically, we design a local-global cross-attention network to jointly model geometric-semantic correlations, introduce a hierarchical graph encoder to explicitly capture semantic hierarchies among categories, and integrate textual modality with a two-stage training strategy; hinge loss is further employed to enhance subclass discriminability. To our knowledge, this is the first work achieving zero-shot transfer segmentation on city-scale point clouds without any visual training samples. Our method achieves state-of-the-art performance on nine closed-set benchmarks and significantly improves cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract
Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding. However, existing models are frequently constrained by the limited scale of 3D data and the domain gap between datasets, which lead to reduced generalization capability. To address these challenges, we propose CitySeg, a foundation model for city-scale point cloud semantic segmentation that incorporates text modality to achieve open vocabulary segmentation and zero-shot inference. Specifically, in order to mitigate the issue of non-uniform data distribution across multiple domains, we customize the data preprocessing rules, and propose a local-global cross-attention network to enhance the perception capabilities of point networks in UAV scenarios. To resolve semantic label discrepancies across datasets, we introduce a hierarchical classification strategy. A hierarchical graph established according to the data annotation rules consolidates the data labels, and the graph encoder is used to model the hierarchical relationships between categories. In addition, we propose a two-stage training strategy and employ hinge loss to increase the feature separability of subcategories. Experimental results demonstrate that the proposed CitySeg achieves state-of-the-art (SOTA) performance on nine closed-set benchmarks, significantly outperforming existing approaches. Moreover, for the first time, CitySeg enables zero-shot generalization in city-scale point cloud scenarios without relying on visual information.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited generalization in city-scale 3D point cloud segmentation
Resolves semantic label discrepancies across diverse datasets
Enables zero-shot inference without visual information dependency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Local-global cross-attention network enhances perception
Hierarchical classification strategy resolves label discrepancies
Two-stage training with hinge loss improves separability
🔎 Similar Papers
No similar papers found.
J
Jialei Xu
Huawei Technologies Co., Ltd.
Zizhuang Wei
Zizhuang Wei
Peking University
Computer Vision3D modeling
W
Weikang You
Huawei Technologies Co., Ltd.
L
Linyun Li
Huawei Technologies Co., Ltd.
W
Weijian Sun
Huawei Technologies Co., Ltd.