AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited segmentation accuracy in Referring Expression Image Segmentation (RIS) caused by insufficient alignment between visual and linguistic modalities. To this end, we propose Alignment-aware Mask Learning (AML), which introduces, for the first time, a pixel-level visual-linguistic alignment estimation mechanism. During training, AML dynamically filters out low-alignment regions and focuses on high-reliability semantic cues to refine the segmentation model. This approach significantly enhances the model’s robustness to diverse linguistic expressions and complex scenes, achieving state-of-the-art performance on the RefCOCO benchmark suite and demonstrating improved adaptability to varying description styles and scene variations.

Technology Category

Application Category

📝 Abstract
Referring Image Segmentation (RIS) aims to segment an object in an image identified by a natural language expression. The paper introduces Alignment-Aware Masked Learning (AML), a training strategy to enhance RIS by explicitly estimating pixel-level vision-language alignment, filtering out poorly aligned regions during optimization, and focusing on trustworthy cues. This approach results in state-of-the-art performance on RefCOCO datasets and also enhances robustness to diverse descriptions and scenarios
Problem

Research questions and friction points this paper is trying to address.

Referring Image Segmentation
Vision-Language Alignment
Natural Language Expression
Object Segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Alignment-Aware Masked Learning
Referring Image Segmentation
Vision-Language Alignment
Pixel-level Alignment
Robustness to Language Variations
🔎 Similar Papers
No similar papers found.