🤖 AI Summary
Existing satellite video tracking methods suffer from poor generalization, reliance on scene-specific training, and frequent target loss under occlusion. To address these limitations, we propose a training-free zero-shot tracking framework—the first to integrate the promptable vision foundation model SAM2 into satellite video tracking—augmented with Kalman filter-based motion constraints and a state-machine mechanism to effectively suppress drift and enhance temporal consistency. To enable large-scale evaluation, we introduce MVOT, the first synthetic satellite video tracking dataset. Experiments demonstrate state-of-the-art performance: our method achieves a 5.84% AUC gain on the OOTB benchmark, significantly outperforming both conventional trackers and existing foundation-model-based approaches. These results validate the efficacy and robustness of the zero-shot paradigm in complex remote sensing scenarios.
📝 Abstract
Existing satellite video tracking methods often struggle with generalization, requiring scenario-specific training to achieve satisfactory performance, and are prone to track loss in the presence of occlusion. To address these challenges, we propose SatSAM2, a zero-shot satellite video tracker built on SAM2, designed to adapt foundation models to the remote sensing domain. SatSAM2 introduces two core modules: a Kalman Filter-based Constrained Motion Module (KFCMM) to exploit temporal motion cues and suppress drift, and a Motion-Constrained State Machine (MCSM) to regulate tracking states based on motion dynamics and reliability. To support large-scale evaluation, we propose MatrixCity Video Object Tracking (MVOT), a synthetic benchmark containing 1,500+ sequences and 157K annotated frames with diverse viewpoints, illumination, and occlusion conditions. Extensive experiments on two satellite tracking benchmarks and MVOT show that SatSAM2 outperforms both traditional and foundation model-based trackers, including SAM2 and its variants. Notably, on the OOTB dataset, SatSAM2 achieves a 5.84% AUC improvement over state-of-the-art methods. Our code and dataset will be publicly released to encourage further research.