WOW-Seg: A Word-free Open World Segmentation Model

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limited generalization of closed-set methods and insufficient semantic understanding in existing foundation models for open-world image segmentation by proposing a text-free, vision-prompt-driven segmentation framework. The key innovations include a novel Mask2Token module that converts masks into visual tokens aligned with features from large vision-language models (VLLMs), a cascaded attention mechanism to disentangle instance-level information, and the introduction of RR-7K—the largest region recognition benchmark to date. Evaluated on LVIS, the proposed method achieves a semantic similarity of 89.7 and a semantic IoU of 82.4 with only one-eighth the parameters of current state-of-the-art approaches, demonstrating superior open-world generalization capability.

📝 Abstract

Open world image segmentation aims to achieve precise segmentation and semantic understanding of targets within images by addressing the infinitely open set of object categories encountered in the real world. However, traditional closed-set segmentation approaches struggle to adapt to complex open world scenarios, while foundation segmentation models such as SAM exhibit notable discrepancies between their strong segmentation capabilities and relatively weaker semantic understanding. To bridge these discrepancies, we propose WOW-Seg, a Word-free Open World Segmentation model for segmenting and recognizing objects from open-set categories. Specifically, WOW-Seg introduces a novel visual prompt module, Mask2Token, which transforms image masks into visual tokens and ensures their alignment with the VLLM feature space. Moreover, we introduce the Cascade Attention Mask to decouple information across different instances. This approach mitigates inter-instance interference, leading to a significant improvement in model performance. We further construct an open world region recognition test benchmark: the Region Recognition Dataset (RR-7K). With 7,662 classes, it represents the most extensive category-rich region recognition dataset to date. WOW-Seg attains strong results on the LVIS dataset, achieving a semantic similarity of 89.7 and a semantic IoU of 82.4. This performance surpasses the previous SOTA while using only one-eighth the parameter count. These results underscore the strong open world generalization capabilities of WOW-Seg. The code and related resources are available at https://github.com/AAwcAA/WOW-Seg-Meta.

Problem

Research questions and friction points this paper is trying to address.

open world segmentation

semantic understanding

image segmentation

open-set recognition

visual prompt

Innovation

Methods, ideas, or system contributions that make the work stand out.

Word-free Segmentation

Open World Segmentation

Mask2Token