CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation

📅 2026-01-17

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenges in referring expression object segmentation for remote sensing videos, including weak target saliency, truncated visual information, the absence of large-scale benchmark datasets, and inaccurate localization with error propagation caused by initial memory bias and indiscriminate memory accumulation. To tackle these issues, the authors introduce RS-RVOS Bench, the first large-scale remote sensing video referring segmentation benchmark with causality-aware annotations, and propose MQC-SAM, a memory-quality-controllable online segmentation framework. MQC-SAM calibrates initial memory through motion consistency and incorporates a decoupled attention mechanism to dynamically evaluate and select high-quality semantic features, effectively suppressing noise accumulation. Experimental results demonstrate that MQC-SAM achieves state-of-the-art performance on the proposed benchmark, significantly improving both segmentation accuracy and stability.

Technology Category

Application Category

📝 Abstract

Remote sensing video referring object segmentation (RS-RVOS) is challenged by weak target saliency and severe visual information truncation in dynamic scenes, making it extremely difficult to maintain discriminative target representations during segmentation. Moreover, progress in this field is hindered by the absence of large-scale dedicated benchmarks, while existing models are often affected by biased initial memory construction that impairs accurate instance localization in complex scenarios, as well as indiscriminate memory accumulation that encodes noise from occlusions or misclassifications, leading to persistent error propagation. This paper advances RS-RVOS research through dual contributions in data and methodology. First, we construct RS-RVOS Bench, the first large-scale benchmark comprising 111 video sequences, about 25,000 frames, and 213,000 temporal referring annotations. Unlike common RVOS benchmarks where many expressions are written with access to the full video context, our dataset adopts a strict causality-aware annotation strategy in which linguistic references are generated solely from the target state in the initial frame. Second, we propose a memory-quality-aware online referring segmentation framework, termed Memory Quality Control with Segment Anything Model (MQC-SAM). MQC-SAM introduces a temporal motion consistency module for initial memory calibration, leveraging short-term motion trajectory priors to correct structural deviations and establish accurate memory anchoring. Furthermore, it incorporates a decoupled attention-based memory integration mechanism with dynamic quality assessment, selectively updating high-confidence semantic features while filtering unreliable information, thereby effectively preventing error accumulation and propagation. Extensive experiments on RS-RVOS Bench demonstrate that MQC-SAM achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

remote sensing

referring video object segmentation

memory bias

error propagation

target representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

memory-quality control

referring video object segmentation

remote sensing