🤖 AI Summary
Low-resolution imagery, extremely small targets, and severe noise in remote sensing data severely degrade both detection accuracy and temporal consistency for moving object detection (MOD). To address the limitation of existing probability density estimation–based methods in modeling high-order spatiotemporal dependencies, this paper proposes the first point-cloud–based progressive diffusion denoising framework tailored for MOD: it formulates detection as an iterative recovery of motion target centers from sparse noisy points; introduces a spatial relational aggregation attention mechanism and an implicit memory–driven temporal propagation module to enable dynamic cross-frame feature fusion; and incorporates a progressive MinK optimal transport matching strategy alongside a cluster-missing–robust loss to enhance matching reliability. Evaluated on the RsData benchmark, our method achieves significant improvements in small-object recall and inter-frame consistency, attaining state-of-the-art accuracy and robustness.
📝 Abstract
Moving object detection (MOD) in remote sensing is significantly challenged by low resolution, extremely small object sizes, and complex noise interference. Current deep learning-based MOD methods rely on probability density estimation, which restricts flexible information interaction between objects and across temporal frames. To flexibly capture high-order inter-object and temporal relationships, we propose a point-based MOD in remote sensing. Inspired by diffusion models, the network optimization is formulated as a progressive denoising process that iteratively recovers moving object centers from sparse noisy points. Specifically, we sample scattered features from the backbone outputs as atomic units for subsequent processing, while global feature embeddings are aggregated to compensate for the limited coverage of sparse point features. By modeling spatial relative positions and semantic affinities, Spatial Relation Aggregation Attention is designed to enable high-order interactions among point-level features for enhanced object representation. To enhance temporal consistency, the Temporal Propagation and Global Fusion module is designed, which leverages an implicit memory reasoning mechanism for robust cross-frame feature integration. To align with the progressive denoising process, we propose a progressive MinK optimal transport assignment strategy that establishes specialized learning objectives at each denoising level. Additionally, we introduce a missing loss function to counteract the clustering tendency of denoised points around salient objects. Experiments on the RsData remote sensing MOD dataset show that our MOD method based on scattered point denoising can more effectively explore potential relationships between sparse moving objects and improve the detection capability and temporal consistency.