🤖 AI Summary
Weak interpretability and susceptibility to data distribution bias—leading to spurious shortcut learning—are critical challenges in remote sensing visual question answering (RSVQA). To address these issues, this paper proposes a dual-track solution: (1) introducing *Chessboard*, the first high-balanced, low-bias fine-grained RSVQA dataset, explicitly mitigating answer distribution skew and scene-correlation bias; and (2) designing *Checkmate*, an interpretable model integrating image patch-level visual grounding with a multi-model collaborative verification architecture to ensure traceable and verifiable decision rationales. Extensive experiments demonstrate consistent improvements across mainstream quantized RSVQA models: +3.2–5.7% in inference accuracy and +12.4% in localization accuracy—serving as a quantitative proxy for transparency. Collectively, this work establishes a dual-paradigm foundation for trustworthy RSVQA systems, advancing both data curation and model design toward robust, explainable remote sensing intelligence.
📝 Abstract
Remote Sensing Visual Question Answering (RSVQA) presents unique challenges in ensuring that model decisions are both understandable and grounded in visual content. Current models often suffer from a lack of interpretability and explainability, as well as from biases in dataset distributions that lead to shortcut learning. In this work, we tackle these issues by introducing a novel RSVQA dataset, Chessboard, designed to minimize biases through 3'123'253 questions and a balanced answer distribution. Each answer is linked to one or more cells within the image, enabling fine-grained visual reasoning.
Building on this dataset, we develop an explainable and interpretable model called Checkmate that identifies the image cells most relevant to its decisions. Through extensive experiments across multiple model architectures, we show that our approach improves transparency and supports more trustworthy decision-making in RSVQA systems.