🤖 AI Summary
Dense small-object segmentation in remote sensing imagery remains challenging due to severe annotation costs and inherent difficulties in delineating tiny, closely spaced instances.
Method: This paper introduces the first point-supervised fine-tuning framework for Segment Anything Model (SAM), eliminating the need for pixel-level annotations. We propose a prototype-matching–driven self-training paradigm to suppress pseudo-label noise and a negative-prompt calibration mechanism leveraging non-overlapping priors to mitigate erroneous merging of small objects. The method synergistically integrates SAM’s zero-shot generalization, Hungarian algorithm–based instance matching, self-generated pseudo-labels, and negative-prompt–guided mask refinement.
Contribution/Results: Our approach achieves significant performance gains over SAM and SAM2 on WHU, HRSID, and NWPU VHR-10 benchmarks. Moreover, as a point-to-box transformer, it demonstrates strong cross-task generalization in downstream applications. This work breaks reliance on full supervision, establishing a novel low-cost paradigm for remote sensing image segmentation.
📝 Abstract
Segment Anything Model (SAM) is an advanced foundational model for image segmentation, which is gradually being applied to remote sensing images (RSIs). Due to the domain gap between RSIs and natural images, traditional methods typically use SAM as a source pre-trained model and fine-tune it with fully supervised masks. Unlike these methods, our work focuses on fine-tuning SAM using more convenient and challenging point annotations. Leveraging SAM's zero-shot capabilities, we adopt a self-training framework that iteratively generates pseudo-labels for training. However, if the pseudo-labels contain noisy labels, there is a risk of error accumulation. To address this issue, we extract target prototypes from the target dataset and use the Hungarian algorithm to match them with prediction prototypes, preventing the model from learning in the wrong direction. Additionally, due to the complex backgrounds and dense distribution of objects in RSI, using point prompts may result in multiple objects being recognized as one. To solve this problem, we propose a negative prompt calibration method based on the non-overlapping nature of instance masks. In brief, we use the prompts of overlapping masks as corresponding negative signals, resulting in refined masks. Combining the above methods, we propose a novel Pointly-supervised Segment Anything Model named PointSAM. We conduct experiments on RSI datasets, including WHU, HRSID, and NWPU VHR-10, and the results show that our method significantly outperforms direct testing with SAM, SAM2, and other comparison methods. Furthermore, we introduce PointSAM as a point-to-box converter and achieve encouraging results, suggesting that this method can be extended to other point-supervised tasks. The code is available at https://github.com/Lans1ng/PointSAM.