🤖 AI Summary
This work addresses the instability in category alignment for open-vocabulary object detection in remote sensing imagery, a challenge exacerbated by semantic ambiguity and distribution shifts when relying solely on textual prompts. To overcome this limitation, the paper proposes RS-MPOD, a novel framework that introduces instance-based visual prompts alongside textual prompts within a multimodal fusion architecture—departing from conventional text-only prompting paradigms. By integrating a visual prompt encoder, a multimodal fusion module, and an open-vocabulary detection backbone, RS-MPOD significantly enhances detection robustness across standard, cross-dataset, and fine-grained remote sensing benchmarks. The method demonstrates pronounced advantages under semantically ambiguous conditions while maintaining competitive performance when semantic alignment is well-established.
📝 Abstract
Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal fusion module to integrate visual and textual information when both modalities are available. Extensive experiments on standard, cross-dataset, and fine-grained remote sensing benchmarks show that visual prompting yields more reliable category specification under semantic ambiguity and distribution shifts, while multimodal prompting provides a flexible alternative that remains competitive when textual semantics are well aligned.