Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

📅 2025-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Remote sensing vision-language models (VLMs) suffer from scarcity of high-quality image-text pairs and lack of reliable quality assessment mechanisms for synthetic data. To address this, we propose the first learnable data quality scoring paradigm tailored to remote sensing, leveraging large-scale preference data to train a quality scorer enabling automated ranking and filtering. Our method integrates preference learning, joint fine-tuning of CLIP and Qwen2-VL, reinforcement learning, and Best-of-N inference—overcoming limitations of rule-based heuristics and CLIP-score baselines. When fine-tuning a VLM using only the top 30% highest-scoring synthetic samples, we achieve significantly higher remote sensing semantic understanding accuracy than full-dataset training and CLIP-score filtering, alongside substantial improvements in cross-modal alignment performance. This work establishes a scalable, learning-based framework for quality-aware synthetic data curation in remote sensing VLMs.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic understanding. However, the effectiveness of these VLMs critically depends on high-quality image-text training data that captures rich semantic relationships between visual content and language descriptions. Unlike natural images, RS lacks large-scale interleaved image-text pairs from web data, making data collection challenging. While current approaches rely primarily on rule-based methods or flagship VLMs for data synthesis, a systematic framework for automated quality assessment of such synthetically generated RS visionlanguage data is notably absent. To fill this gap, we propose a novel score model trained on large-scale RS visionlanguage preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs (e.g., Qwen2-VL) with the top 30% of data ranked by our score model achieves superior interpretation accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches. Furthermore, we demonstrate applications of our scoring model for reinforcement learning (RL) training and best-of-N (BoN) testtime scaling, enabling significant improvements in VLM performance for RS tasks.
Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality image-text data for remote sensing vision-language models.
Absence of systematic quality assessment for synthetic RS vision-language data.
Need for improved interpretation accuracy in remote sensing tasks using VLMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trained score model for quality assessment
Fine-tuning VLMs with top-ranked data
Applied scoring model in RL and BoN
🔎 Similar Papers
No similar papers found.