🤖 AI Summary
To address the challenges of weak cross-dataset generalization, insufficient text prompt interaction, and the difficulty in simultaneously achieving high-precision oriented detection and real-time inference in open-vocabulary object detection (OVD) for remote sensing imagery, this paper proposes the first multimodal text-prompt-enabled general detection framework for remote sensing. Methodologically, it integrates CLIP-based vision-language alignment, an enhanced YOLO backbone, a learnable multi-directional regression head, a prompt-guided feature disentanglement module, a collaborative multi-task detection head architecture, and a progressive knowledge distillation training paradigm. Evaluated on seven public remote sensing datasets, the framework achieves an 8.7% average precision gain over YOLO-World, marking the first solution to unify high-accuracy rotated and axis-aligned bounding box detection with real-time inference at 20.8 FPS. The code and models are publicly released.
📝 Abstract
Remote sensing object detection has made significant progress, but most studies still focus on closed-set detection, limiting generalization across diverse datasets. Open-vocabulary object detection (OVD) provides a solution by leveraging multimodal associations between text prompts and visual features. However, existing OVD methods for remote sensing (RS) images are constrained by small-scale datasets and fail to address the unique challenges of remote sensing interpretation, include oriented object detection and the need for both high precision and real-time performance in diverse scenarios. To tackle these challenges, we propose OpenRSD, a universal open-prompt RS object detection framework. OpenRSD supports multimodal prompts and integrates multi-task detection heads to balance accuracy and real-time requirements. Additionally, we design a multi-stage training pipeline to enhance the generalization of model. Evaluated on seven public datasets, OpenRSD demonstrates superior performance in oriented and horizontal bounding box detection, with real-time inference capabilities suitable for large-scale RS image analysis. Compared to YOLO-World, OpenRSD exhibits an 8.7% higher average precision and achieves an inference speed of 20.8 FPS. Codes and models will be released.