What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary image segmentation methods adopt a “segment-then-match” paradigm, which contradicts the human cognitive process of “semantic understanding before spatial localization,” leading to misalignment between concepts and regions. This work proposes a cognition-inspired framework: (1) a generative vision-language model (G-VLM) autonomously generates object-level semantic concepts; (2) a concept-aware visual enhancement module enables text-image cross-modal feature fusion; and (3) a cognitive decoder supports dynamic subset category selection, enabling truly vocabulary-agnostic, end-to-end segmentation for the first time. Evaluated on A-150, the method achieves 27.2 PQ, 17.0 mAP, and 35.3 mIoU—outperforming prior art across six benchmarks including Cityscapes. It significantly improves semantic consistency and spatial localization accuracy in open-vocabulary scenarios.

Technology Category

Application Category

📝 Abstract
Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer Module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on target concepts. (3) A Cognition-Inspired Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching $27.2$ PQ, $17.0$ mAP, and $35.3$ mIoU on A-150. It further attains $56.2$, $28.2$, $15.4$, $59.2$, $18.7$, and $95.8$ mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free segmentation, offering enhanced flexibility in recognizing unseen categories. Code will be public.
Problem

Research questions and friction points this paper is trying to address.

Open vocabulary image segmentation lacks semantic alignment with human cognition
Existing methods perform class-agnostic segmentation before category matching
Current approaches poorly align region segmentation with target concepts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Vision-Language Model mimics human cognition
Concept-Aware Visual Enhancer fuses text and visuals
Cognition-Inspired Decoder integrates local and semantic features
🔎 Similar Papers
No similar papers found.
Jianghang Lin
Jianghang Lin
Xiamen University
Multimodal Large Language ModelVision-Language ModelSemi/Weakly-Supervised Learning
Y
Yue Hu
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China.
J
Jiangtao Shen
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China.
Y
Yunhang Shen
Tencent Youtu Lab, China.
L
Liujuan Cao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China.
Shengchuan Zhang
Shengchuan Zhang
Xiamen University
computer visionmachine learning
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China.