🤖 AI Summary
To address the challenge of modeling fine-grained semantics for ship detection in complex remote sensing scenes, this paper proposes ShipSem, a semantic-aware ship detection framework. ShipSem integrates vision-language models (VLMs) and introduces a multi-scale adaptive sliding-window strategy to achieve cross-modal semantic alignment between image regions and textual descriptions. We further construct ShipSem-VL—the first vision-language dataset tailored for fine-grained ship attribute recognition—supporting attribute-level understanding including ship type, heading, and operational state. Extensive experiments across three tasks demonstrate that ShipSem significantly improves detection accuracy and semantic expressiveness, exhibiting enhanced robustness against complex backgrounds, small objects, and high-density configurations. By advancing ship detection from geometric localization toward semantic comprehension, ShipSem establishes a novel paradigm for intelligent maritime monitoring.
📝 Abstract
Ship detection in remote sensing imagery is a critical task with wide-ranging applications, such as maritime activity monitoring, shipping logistics, and environmental studies. However, existing methods often struggle to capture fine-grained semantic information, limiting their effectiveness in complex scenarios. To address these challenges, we propose a novel detection framework that combines Vision-Language Models (VLMs) with a multi-scale adaptive sliding window strategy. To facilitate Semantic-Aware Ship Detection (SASD), we introduce ShipSem-VL, a specialized Vision-Language dataset designed to capture fine-grained ship attributes. We evaluate our framework through three well-defined tasks, providing a comprehensive analysis of its performance and demonstrating its effectiveness in advancing SASD from multiple perspectives.