🤖 AI Summary
Remote sensing vision-language models (VLMs) suffer from scarcity of high-quality image-text pairs and lack of reliable quality assessment mechanisms for synthetic data. To address this, we propose the first learnable data quality scoring paradigm tailored to remote sensing, leveraging large-scale preference data to train a quality scorer enabling automated ranking and filtering. Our method integrates preference learning, joint fine-tuning of CLIP and Qwen2-VL, reinforcement learning, and Best-of-N inference—overcoming limitations of rule-based heuristics and CLIP-score baselines. When fine-tuning a VLM using only the top 30% highest-scoring synthetic samples, we achieve significantly higher remote sensing semantic understanding accuracy than full-dataset training and CLIP-score filtering, alongside substantial improvements in cross-modal alignment performance. This work establishes a scalable, learning-based framework for quality-aware synthetic data curation in remote sensing VLMs.
📝 Abstract
Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic understanding. However, the effectiveness of these VLMs critically depends on high-quality image-text training data that captures rich semantic relationships between visual content and language descriptions. Unlike natural images, RS lacks large-scale interleaved image-text pairs from web data, making data collection challenging. While current approaches rely primarily on rule-based methods or flagship VLMs for data synthesis, a systematic framework for automated quality assessment of such synthetically generated RS visionlanguage data is notably absent. To fill this gap, we propose a novel score model trained on large-scale RS visionlanguage preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs (e.g., Qwen2-VL) with the top 30% of data ranked by our score model achieves superior interpretation accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches. Furthermore, we demonstrate applications of our scoring model for reinforcement learning (RL) training and best-of-N (BoN) testtime scaling, enabling significant improvements in VLM performance for RS tasks.