Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In remote sensing vision-language model (VLM) pretraining, single-image–multi-caption pairs often exhibit semantic redundancy, leading to high computational overhead and inefficient information utilization. To address this, we propose a Redundancy-Aware Weighted Feature Aggregation (WFA) framework: it introduces a novel dual-weight mechanism that jointly leverages non-parametric uniqueness scoring—based on BLEU—and end-to-end learnable attention, enhancing semantic discriminability while preserving interpretability; and establishes a lightweight pretraining paradigm tailored for remote sensing image-text alignment. Experiments demonstrate substantial improvements in text-to-image retrieval performance (R@1 ↑3.2%) and ~27% reduction in redundant computation. Furthermore, we provide a principled guideline for selecting task- and resource-aware weighting strategies. The code is publicly available.

Technology Category

Application Category

📝 Abstract
The development of foundation models through pretraining of vision-language models (VLMs) has recently attracted great attention in remote sensing (RS). VLM pretraining aims to learn image and language alignments from a large number of image-text pairs. Each pretraining image is often associated with multiple captions containing redundant information due to repeated or semantically similar phrases, resulting in increased pretraining and inference time. To overcome this, we introduce a weighted feature aggregation (WFA) strategy for VLM pretraining in RS. Our strategy aims to extract and exploit complementary information from multiple captions per image while reducing redundancies through feature aggregation with importance weighting. To calculate adaptive importance weights for different captions of each image, we propose two techniques: (i) non-parametric uniqueness and (ii) learning-based attention. In the first technique, importance weights are calculated based on the bilingual evaluation understudy (BLEU) scores of the captions to emphasize unique sentences and reduce the influence of repetitive ones. In the second technique, importance weights are learned through an attention mechanism instead of relying on hand-crafted features. The effectiveness of the proposed WFA strategy with the two techniques is analyzed in terms of downstream performance on text-to-image retrieval in RS. Experimental results show that the proposed strategy enables efficient and effective pretraining of VLMs in RS. Based on the experimental analysis, we derive guidelines for selecting appropriate techniques depending on downstream task requirements and resource constraints. The code of this work is publicly available at https://git.tu-berlin.de/rsim/redundacy-aware-rs-vlm.
Problem

Research questions and friction points this paper is trying to address.

Reduce redundancy in vision-language pretraining for remote sensing
Improve efficiency of image-text alignment learning in VLMs
Optimize caption weighting for better downstream retrieval performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted feature aggregation reduces caption redundancy
Non-parametric uniqueness weights via BLEU scores
Learning-based attention for adaptive caption weighting
🔎 Similar Papers
No similar papers found.
M
Mathis Jurgen Adler
TU Berlin, Germany; BIFOLD, Germany
L
L. Hackel
TU Berlin, Germany; BIFOLD, Germany
Gencer Sumbul
Gencer Sumbul
Ecole Polytechnique Fédérale de Lausanne (EPFL)
remote sensingcomputer visionmachine learningdeep learningweak supervision
B
Begum Demir
TU Berlin, Germany; BIFOLD, Germany