RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the significant degradation in visual-semantic reasoning performance of existing multimodal large language models for remote sensing when confronted with real-world image degradations—such as cloud or haze occlusion—and noisy textual instructions, including ambiguity or missing information. To tackle this challenge, the authors propose RemoteShield, which presents the first systematic evaluation of model vulnerability under realistic multimodal perturbations and introduces semantic-equivalence clusters composed of clean samples and their perturbed variants. A cross-condition alignment preference learning framework is employed to guide the model toward generating consistent and robust responses across diverse perturbations. Experimental results demonstrate that RemoteShield substantially outperforms current baselines across three Earth observation tasks, effectively enhancing semantic consistency and robustness under dual-modality noise.

Technology Category

Application Category

📝 Abstract

A robust Multimodal Large Language Model (MLLM) for Earth Observation should maintain consistent interpretation and reasoning under realistic input variations. However, current Remote Sensing MLLMs fail to meet this requirement. Trained on carefully curated clean datasets, they learn brittle mappings that do not generalize to noisy conditions in operational Earth Observation. Consequently, their performance degrades when confronted with imperfect inputs in deployment. To quantify this vulnerability, we construct a realistic set of multimodal perturbations, including visual degradations such as cloud and fog cover, together with diverse human-centric textual variations ranging from colloquialisms to vague or omitted instructions. Empirical evaluations show that these perturbations significantly impair the visual-semantic reasoning capabilities of leading RS foundation models. To address this limitation, we introduce RemoteShield, a robust Remote Sensing MLLM trained to maintain consistent outputs across realistic input variations. During training, each clean sample is paired with its image-text perturbed variants to form a semantic equivalence cluster. Rather than directly fitting noisy samples, RemoteShield is optimized through preference learning over clean and perturbed conditions within the same cluster. By comparing model responses to clean and corrupted inputs, the model is encouraged to favor stable responses over perturbation-induced failures. This cross-condition alignment helps the model focus on underlying task semantics despite visual degradations and textual noise. Experiments on three Earth Observation tasks show that RemoteShield consistently delivers stronger robustness and cross-condition consistency than representative baselines under realistic multimodal perturbations.

Problem

Research questions and friction points this paper is trying to address.

Robustness

Multimodal Large Language Models

Earth Observation

Input Perturbations

Visual-Semantic Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Model

Robustness

Preference Learning