🤖 AI Summary
Optical remote sensing images are often degraded by cloud and haze, hindering accurate identification of land cover classes such as water bodies. To address this limitation, this work pioneers the integration of synthetic aperture radar (SAR) imagery into remote sensing visual question answering (RSVQA). We introduce the first benchmark dataset enabling joint SAR–optical multimodal reasoning and propose a two-stage, language-driven, end-to-end multimodal fusion framework: dual encoders extract modality-specific features; cross-modal alignment and textualization achieve semantic unification; and decision-level fusion integrates complementary information. Experiments demonstrate that SAR-only inputs yield competitive performance, while SAR–optical fusion significantly outperforms unimodal baselines on our dataset—particularly for water-body-related questions—achieving substantial accuracy gains. This validates SAR’s robustness and complementary information gain under adverse weather conditions.
📝 Abstract
Remote sensing visual question answering (RSVQA) is a task that automatically extracts information from satellite images and processes a question to predict the answer from the images in textual form, helping with the interpretation of the image. While different methods have been proposed to extract information from optical images with different spectral bands and resolutions, no method has been proposed to answer questions from Synthetic Aperture Radar (SAR) images. SAR images capture electromagnetic information from the scene, and are less affected by atmospheric conditions, such as clouds. In this work, our objective is to introduce SAR in the RSVQA task, finding the best way to use this modality. In our research, we carry out a study on different pipelines for the task of RSVQA taking into account information from both SAR and optical data. To this purpose, we also present a dataset that allows for the introduction of SAR images in the RSVQA framework. We propose two different models to include the SAR modality. The first one is an end-to-end method in which we add an additional encoder for the SAR modality. In the second approach, we build on a two-stage framework. First, relevant information is extracted from SAR and, optionally, optical data. This information is then translated into natural language to be used in the second step which only relies on a language model to provide the answer. We find that the second pipeline allows us to obtain good results with SAR images alone. We then try various types of fusion methods to use SAR and optical images together, finding that a fusion at the decision level achieves the best results on the proposed dataset. We show that SAR data offers additional information when fused with the optical modality, particularly for questions related to specific land cover classes, such as water areas.