LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary semantic segmentation often suffers from hallucinations, missed detections, and misaligned visual-textual representations in complex scenes due to the absence of object priors and spatial constraints. To address these challenges, this work proposes LoGoSeg, a single-stage end-to-end framework that innovatively incorporates object existence priors to suppress hallucinations, introduces a region-aware alignment module to enhance localization accuracy, and employs a dual-stream fusion mechanism to effectively integrate local structural details with global semantic context. Built upon vision-language models such as CLIP, LoGoSeg operates without external masks or auxiliary models and achieves state-of-the-art performance across six benchmark datasets—A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b—demonstrating exceptional generalization capability.

Technology Category

Application Category

📝 Abstract
Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation
spatial alignment
object hallucination
region-level constraints
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary segmentation
vision-language model
region-aware alignment
object existence prior
dual-stream fusion
🔎 Similar Papers
No similar papers found.
J
Junyang Chen
School of Computer Science and Engineering, Southeast University, Nanjing 211189, China; Key Laboratory of Computer Network and Information Integration (Ministry of Education), Southeast University, Nanjing 211189, China
X
Xiangbo Lv
School of Computer Science and Engineering, Southeast University, Nanjing 211189, China; Key Laboratory of Computer Network and Information Integration (Ministry of Education), Southeast University, Nanjing 211189, China; Lenovo Research, Shanghai 201203, China
Zhiqiang Kou
Zhiqiang Kou
Ph.D. Student at Southeast University, Internship at RIKEN AIP
Machine learning
X
Xingdong Sheng
Lenovo Research, Shanghai 201203, China
Ning Xu
Ning Xu
School of Computer Science and Engineering, Southeast University, China
Machine Learning
Y
Yiguo Qiao
School of Computer Science and Engineering, Southeast University, Nanjing 211189, China; Key Laboratory of Computer Network and Information Integration (Ministry of Education), Southeast University, Nanjing 211189, China