Visual Question Answering on Multiple Remote Sensing Image Modalities

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses insufficient cross-modal semantic understanding in remote sensing multimodal visual question answering (VQA). We propose a novel paradigm integrating high-resolution RGB, multispectral, and synthetic aperture radar (SAR) imagery for VQA. Our MM-RSVQA model extends VisualBERT with dedicated multimodal feature alignment and learnable adaptive fusion modules to handle heterogeneous, multi-resolution inputs. We further develop an automated pipeline for generating large-scale, semantically rich remote sensing VQA data. To support benchmarking and reproducibility, we introduce TAMMI—the first extensible multimodal remote sensing VQA benchmark dataset. Evaluated on TAMMI, MM-RSVQA achieves 65.56% accuracy, substantially outperforming unimodal baselines. Both the codebase and TAMMI are publicly released, establishing foundational resources and a new methodological framework for multimodal VQA in remote sensing and other domains such as medical imaging.

Technology Category

Application Category

📝 Abstract
The extraction of visual features is an essential step in Visual Question Answering (VQA). Building a good visual representation of the analyzed scene is indeed one of the essential keys for the system to be able to correctly understand the latter in order to answer complex questions. In many fields such as remote sensing, the visual feature extraction step could benefit significantly from leveraging different image modalities carrying complementary spectral, spatial and contextual information. In this work, we propose to add multiple image modalities to VQA in the particular context of remote sensing, leading to a novel task for the computer vision community. To this end, we introduce a new VQA dataset, named TAMMI (Text and Multi-Modal Imagery) with diverse questions on scenes described by three different modalities (very high resolution RGB, multi-spectral imaging data and synthetic aperture radar). Thanks to an automated pipeline, this dataset can be easily extended according to experimental needs. We also propose the MM-RSVQA (Multi-modal Multi-resolution Remote Sensing Visual Question Answering) model, based on VisualBERT, a vision-language transformer, to effectively combine the multiple image modalities and text through a trainable fusion process. A preliminary experimental study shows promising results of our methodology on this challenging dataset, with an accuracy of 65.56% on the targeted VQA task. This pioneering work paves the way for the community to a new multi-modal multi-resolution VQA task that can be applied in other imaging domains (such as medical imaging) where multi-modality can enrich the visual representation of a scene. The dataset and code are available at https://tammi.sylvainlobry.com/.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VQA with multi-modal remote sensing imagery
Integrating spectral, spatial, and contextual data for scene understanding
Developing a fusion model for multi-resolution image-text analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging multiple remote sensing image modalities
Introducing TAMMI dataset with diverse questions
Proposing MM-RSVQA model based on VisualBERT
🔎 Similar Papers
No similar papers found.
H
Hichem Boussaid
LIPADE, Université Paris Cité, France
L
Lucrezia Tosato
ONERA, France
F
Flora Weissgerber
ONERA, France
C
Camille Kurtz
LIPADE, Université Paris Cité, France
Laurent Wendling
Laurent Wendling
Professeur d'informatique, Université Paris Cité
Pattern recognition
Sylvain Lobry
Sylvain Lobry
Associate professor, Université Paris Cité
Remote sensingVisual Question AnsweringDeep learningImage processingSAR