OSM-based Domain Adaptation for Remote Sensing VLMs

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of high-quality image–text pairs that hinders the development of vision–language models (VLMs) in remote sensing. Existing pseudo-labeling approaches rely on costly large teacher models, limiting scalability and accessibility. To overcome this, the authors propose OSMDA, a novel framework that enables self-adaptation of remote sensing VLMs without human annotations or external strong models. OSMDA leverages the base VLM itself in conjunction with OpenStreetMap renderings, optical character recognition, and geospatial metadata to automatically generate image–text pairs, followed by self-supervised fine-tuning using only satellite imagery. Evaluated across ten remote sensing vision–language tasks, OSMDA outperforms nine established baselines and achieves state-of-the-art performance when combined with real data, all while significantly reducing training costs compared to teacher-dependent methods.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

domain adaptation
remote sensing
vision-language models
pseudo-labeling
data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain Adaptation
Vision-Language Models
OpenStreetMap
Self-supervision
Remote Sensing
🔎 Similar Papers
No similar papers found.
S
Stefan Maria Ailuro
INSAIT, Sofia University "St. Kliment Ohridski"
M
Mario Markov
INSAIT, Sofia University "St. Kliment Ohridski"
M
Mohammad Mahdi
INSAIT, Sofia University "St. Kliment Ohridski"
D
Delyan Boychev
INSAIT, Sofia University "St. Kliment Ohridski"
Luc Van Gool
Luc Van Gool
professor computer vision INSAIT Sofia University, em. KU Leuven, em. ETHZ, Toyota Lab TRACE
computer visionmachine learningAIautonomous carscultural heritage
Danda Pani Paudel
Danda Pani Paudel
INSAIT Sofia University
Computer VisionRoboticsEarth Observation