CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation

📅 2026-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges in referring expression object segmentation for remote sensing videos, including weak target saliency, truncated visual information, the absence of large-scale benchmark datasets, and inaccurate localization with error propagation caused by initial memory bias and indiscriminate memory accumulation. To tackle these issues, the authors introduce RS-RVOS Bench, the first large-scale remote sensing video referring segmentation benchmark with causality-aware annotations, and propose MQC-SAM, a memory-quality-controllable online segmentation framework. MQC-SAM calibrates initial memory through motion consistency and incorporates a decoupled attention mechanism to dynamically evaluate and select high-quality semantic features, effectively suppressing noise accumulation. Experimental results demonstrate that MQC-SAM achieves state-of-the-art performance on the proposed benchmark, significantly improving both segmentation accuracy and stability.

Technology Category

Application Category

📝 Abstract
Remote sensing video referring object segmentation (RS-RVOS) is challenged by weak target saliency and severe visual information truncation in dynamic scenes, making it extremely difficult to maintain discriminative target representations during segmentation. Moreover, progress in this field is hindered by the absence of large-scale dedicated benchmarks, while existing models are often affected by biased initial memory construction that impairs accurate instance localization in complex scenarios, as well as indiscriminate memory accumulation that encodes noise from occlusions or misclassifications, leading to persistent error propagation. This paper advances RS-RVOS research through dual contributions in data and methodology. First, we construct RS-RVOS Bench, the first large-scale benchmark comprising 111 video sequences, about 25,000 frames, and 213,000 temporal referring annotations. Unlike common RVOS benchmarks where many expressions are written with access to the full video context, our dataset adopts a strict causality-aware annotation strategy in which linguistic references are generated solely from the target state in the initial frame. Second, we propose a memory-quality-aware online referring segmentation framework, termed Memory Quality Control with Segment Anything Model (MQC-SAM). MQC-SAM introduces a temporal motion consistency module for initial memory calibration, leveraging short-term motion trajectory priors to correct structural deviations and establish accurate memory anchoring. Furthermore, it incorporates a decoupled attention-based memory integration mechanism with dynamic quality assessment, selectively updating high-confidence semantic features while filtering unreliable information, thereby effectively preventing error accumulation and propagation. Extensive experiments on RS-RVOS Bench demonstrate that MQC-SAM achieves state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

remote sensing
referring video object segmentation
memory bias
error propagation
target representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

memory-quality control
referring video object segmentation
remote sensing
causality-aware annotation
motion consistency
🔎 Similar Papers
No similar papers found.
H
Haochen Jiang
School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China
Y
Yuzhe Sun
School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China
Zhe Dong
Zhe Dong
Microsoft AI
T
Tianzhu Liu
School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China
Yanfeng Gu
Yanfeng Gu
Professor of Electronics Engineering, Harbin Institute of Technology
image processingpattern recognitionmachine learning