Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the critical challenge of cross-modal registration between optical and synthetic aperture radar (SAR) images in remote sensing for disaster response, systematically evaluating 24 pre-trained matchers under a zero-shot setting. By leveraging tile-based large-image inference, robust geometric filtering, and control-point-based metrics, the work demonstrates that explicit cross-modal training is not essential—foundation model features such as DINOv2 exhibit sufficient modality invariance to partially substitute for supervised signals. Notably, deployment protocols exert a far greater influence on accuracy than model choice itself. Experiments show that RoMa achieves an average error of 3.0 pixels on SpaceNet9 without any cross-modal training, while MatchAnything-ELoFTR attains 3.4 pixels; further refinement of deployment protocols reduces registration error by up to 33-fold.

Technology Category

Application Category

📝 Abstract

Cross-modal optical-SAR (Synthetic Aperture Radar) registration is a bottleneck for disaster-response via remote sensing, yet modern image matchers are developed and benchmarked almost exclusively on natural-image domains. We evaluate twenty-four pretrained matcher families--in a zero-shot setting with no fine-tuning or domain adaptation on satellite or SAR data--on SpaceNet9 and two additional cross-modal benchmarks under a deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics. Our results reveal asymmetric transfer--matchers with explicit cross-modal training do not uniformly outperform those without it. While XoFTR (trained for visible-thermal matching) and RoMa achieve the lowest reported mean error at $3.0$ px on the labeled SpaceNet9 training scenes, RoMa achieves this without any cross-modal training, and MatchAnything-ELoFTR ($3.4$ px)--trained on synthetic cross-modal pairs--matches closely, suggesting (as a working hypothesis) that foundation-model features (DINOv2) may contribute to modality invariance that partially substitutes for explicit cross-modal supervision. 3D-reconstruction matchers (MASt3R, DUSt3R), which are not designed for traditional 2D image matching, are highly protocol-sensitive and remain fragile under default settings. Deployment protocol choices (geometry model, tile size, inlier gating) shift accuracy by up to $33\times$ for a single matcher, sometimes exceeding the effect of swapping matchers entirely within the evaluated sweep--affine geometry alone reduces mean error from $12.34$ to $9.74$ px. These findings inform both practical deployment of existing matchers and future matcher design for cross-modal satellite registration.

Problem

Research questions and friction points this paper is trying to address.

cross-modal registration

SAR-optical matching

pretrained image matchers

satellite image registration

zero-shot transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal registration

pretrained image matchers

SAR-optical matching

foundation model features

deployment protocol sensitivity

🔎 Similar Papers

No similar papers found.

Authors to Follow