MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

📅 2024-11-28
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Conventional image augmentation techniques fail in referring image segmentation (RIS), and existing models exhibit insufficient robustness to occlusion, linguistic ambiguity, and missing information. Method: This paper proposes a semantic distortion-aware data augmentation framework featuring a novel co-occurring image-text dual-channel random masking strategy, integrated with distortion-aware contextual learning (DCL), cross-modal feature reconstruction, and consistency constraints to explicitly model semantic distortion relationships between vision and language modalities. Contribution/Results: The framework overcomes the poor generalizability of prior augmentation methods in RIS, achieving the first universal performance improvement under both weakly supervised and fully supervised settings. It attains state-of-the-art results on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks and is plug-and-play compatible with diverse RIS architectures. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.
Problem

Research questions and friction points this paper is trying to address.

Developing data augmentation for referring image segmentation
Addressing performance degradation from conventional image augmentations
Enhancing model robustness to occlusions and linguistic complexities
Innovation

Methods, ideas, or system contributions that make the work stand out.

MaskRIS uses image and text masking
It applies Distortion-aware Contextual Learning
The framework enhances robustness to occlusions
🔎 Similar Papers
No similar papers found.