A Simple Baseline with Single-encoder for Referring Image Segmentation

📅 2024-08-28
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
In referring image segmentation (RIS), dual-encoder architectures suffer from insufficient pixel-token alignment due to the lack of dense cross-modal interaction during pretraining, while multimodal fusion modules incur substantial computational overhead. To address these issues, this paper proposes the first unified RIS framework based on the single-encoder BEiT-3 architecture. By sharing self-attention mechanisms throughout the entire network, it enables end-to-end, fine-grained joint modeling of vision and language. Furthermore, we introduce a Shared Feature Pyramid Network (Shared FPN) and a Shared Mask Decoder to strengthen cross-modal alignment and improve decoding efficiency. Our method achieves state-of-the-art performance on major RIS benchmarks, with ~40% faster inference speed and 35% fewer parameters compared to prior approaches.

Technology Category

Application Category

📝 Abstract
Referring image segmentation (RIS) requires dense vision-language interactions between visual pixels and textual words to segment objects based on a given description. However, commonly adapted dual-encoders in RIS, e.g., Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal dual-encoder), lack dense multi-modal interactions during pre-training, leading to a gap with a pixel-level RIS task. To bridge this gap, existing RIS methods often rely on multi-modal fusion modules that interact two encoders, but this approach leads to high computational costs. In this paper, we present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention across all framework components. This enables seamless interactions of two modalities from input to final prediction, producing granularly aligned multi-modal features. Furthermore, we propose lightweight yet effective decoder modules, a Shared FPN and a Shared Mask Decoder, which contribute to the high efficiency of our model. Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets while maintaining computational efficiency, compared to the most recent SoTA methods based on dual-encoders.
Problem

Research questions and friction points this paper is trying to address.

Bridging vision-language gap in referring image segmentation
Reducing computational costs of multi-modal fusion modules
Enhancing efficiency with single-encoder and lightweight decoders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-encoder BEiT-3 for dense vision-language interactions
Shared self-attention across all framework components
Lightweight Shared FPN and Shared Mask Decoder
🔎 Similar Papers
No similar papers found.
S
Seonghoon Yu
AI Graduate School, GIST, South Korea
I
Ilchae Jung
NA VER Cloud, South Korea
B
Byeongju Han
NA VER Cloud, South Korea
T
Taeoh Kim
NA VER Cloud, South Korea
Y
Yunho Kim
Electrical Engineering and Computer Science, GIST, South Korea
Dongyoon Wee
Dongyoon Wee
Leader at Clova AI, NAVER corp
video representation learningobject trackingneural radiance field
Jeany Son
Jeany Son
POSTECH
Computer visionDeep Learning