Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability and limited generalization of existing heterogeneous multimodal remote sensing object detection methods, which jointly optimize modality alignment and task-specific objectives during fine-tuning. To overcome this, we propose BabelRS, a novel framework that explicitly decouples modality alignment from downstream detection by introducing language as a semantic pivot—a first in this domain. BabelRS leverages language-pivot pretraining, Concept-Shared Instruction Alignment (CSIA), and Layered Visual-Semantic Annealing (LVSA) to effectively bridge the semantic gap across heterogeneous modalities, while integrating multiscale feature aggregation for semantics-guided detection. Experiments demonstrate that BabelRS achieves state-of-the-art performance across multiple remote sensing benchmarks, delivering significantly improved detection accuracy and training stability without relying on complex engineering tricks.

Technology Category

Application Category

📝 Abstract
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.
Problem

Research questions and friction points this paper is trying to address.

heterogeneous multi-modal
remote sensing object detection
modality alignment
downstream task learning
training stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

language-pivoted pretraining
heterogeneous multi-modal detection
modality alignment
visual-semantic annealing
remote sensing object detection
🔎 Similar Papers
No similar papers found.