AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of vision-language models in remote sensing imagery, where insufficient textual semantic coverage and weak visual feature adaptability hinder performance in aerial scenes characterized by high intra-class appearance variability and fine-grained distinctions. To overcome these challenges, the authors propose a knowledge distillation framework tailored for remote sensing: a large language model generates and validates semantically rich textual prototypes, which guide a lightweight student network via an offline teacher model to learn adaptive prompts in both visual and language encoders, enabling efficient cross-modal alignment. The approach introduces only a minimal number of trainable parameters yet achieves significant improvements across six optical remote sensing benchmarks, notably enhancing few-shot classification accuracy, base-class performance, and mean recall in cross-modal retrieval, while maintaining strong model adaptability and generalization.

Technology Category

Application Category

📝 Abstract
Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
remote sensing imagery
semantic coverage
visual adaptability
aerial scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models
knowledge distillation
remote sensing
prompt tuning
cross-modal retrieval
🔎 Similar Papers
No similar papers found.
Y
Yu Hu
The University of British Columbia, Okanagan, Kelowna, BC, Canada
Jianyang Gu
Jianyang Gu
The Ohio State University
ImageomicsDataset DistillationData-centric AI
H
Hao Liu
The University of British Columbia, Okanagan, Kelowna, BC, Canada
Y
Yue Cao
The University of British Columbia, Okanagan, Kelowna, BC, Canada
J
Jozsef Hamari
TerraSense Analytics, Kelowna, BC, Canada
Zheng Liu
Zheng Liu
University of British Columbia (Okanagan)
Diagnostics & prognosticsdata/information fusionmachine/computer visionindustrial inspectiondigital twin
M
Mohsen Zardadi
TerraSense Analytics, Kelowna, BC, Canada