OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address the challenges of weak cross-dataset generalization, insufficient text prompt interaction, and the difficulty in simultaneously achieving high-precision oriented detection and real-time inference in open-vocabulary object detection (OVD) for remote sensing imagery, this paper proposes the first multimodal text-prompt-enabled general detection framework for remote sensing. Methodologically, it integrates CLIP-based vision-language alignment, an enhanced YOLO backbone, a learnable multi-directional regression head, a prompt-guided feature disentanglement module, a collaborative multi-task detection head architecture, and a progressive knowledge distillation training paradigm. Evaluated on seven public remote sensing datasets, the framework achieves an 8.7% average precision gain over YOLO-World, marking the first solution to unify high-accuracy rotated and axis-aligned bounding box detection with real-time inference at 20.8 FPS. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Remote sensing object detection has made significant progress, but most studies still focus on closed-set detection, limiting generalization across diverse datasets. Open-vocabulary object detection (OVD) provides a solution by leveraging multimodal associations between text prompts and visual features. However, existing OVD methods for remote sensing (RS) images are constrained by small-scale datasets and fail to address the unique challenges of remote sensing interpretation, include oriented object detection and the need for both high precision and real-time performance in diverse scenarios. To tackle these challenges, we propose OpenRSD, a universal open-prompt RS object detection framework. OpenRSD supports multimodal prompts and integrates multi-task detection heads to balance accuracy and real-time requirements. Additionally, we design a multi-stage training pipeline to enhance the generalization of model. Evaluated on seven public datasets, OpenRSD demonstrates superior performance in oriented and horizontal bounding box detection, with real-time inference capabilities suitable for large-scale RS image analysis. Compared to YOLO-World, OpenRSD exhibits an 8.7% higher average precision and achieves an inference speed of 20.8 FPS. Codes and models will be released.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of closed-set detection in remote sensing.

Solves challenges in open-vocabulary object detection for RS images.

Balances accuracy and real-time performance in diverse scenarios.

Innovation

Methods, ideas, or system contributions that make the work stand out.

OpenRSD integrates multimodal prompts for object detection.

Multi-task detection heads balance accuracy and real-time performance.

Multi-stage training enhances model generalization across datasets.

🔎 Similar Papers

Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community