Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the instability and limited generalization of existing heterogeneous multimodal remote sensing object detection methods, which jointly optimize modality alignment and task-specific objectives during fine-tuning. To overcome this, we propose BabelRS, a novel framework that explicitly decouples modality alignment from downstream detection by introducing language as a semantic pivot—a first in this domain. BabelRS leverages language-pivot pretraining, Concept-Shared Instruction Alignment (CSIA), and Layered Visual-Semantic Annealing (LVSA) to effectively bridge the semantic gap across heterogeneous modalities, while integrating multiscale feature aggregation for semantics-guided detection. Experiments demonstrate that BabelRS achieves state-of-the-art performance across multiple remote sensing benchmarks, delivering significantly improved detection accuracy and training stability without relying on complex engineering tricks.

Technology Category

Application Category

📝 Abstract

Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.

Problem

Research questions and friction points this paper is trying to address.

heterogeneous multi-modal

remote sensing object detection

modality alignment

downstream task learning

training stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

language-pivoted pretraining

heterogeneous multi-modal detection

modality alignment