Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of directly applying RGB-pretrained vision-language models to thermal infrared drone imagery by proposing a lightweight multimodal adaptation framework. The approach leverages multimodal projection alignment to transfer InternVL3 and Qwen-VL family models into the thermal infrared domain, followed by fine-tuning on real-world drone data to enable species identification, individual counting, and habitat semantic understanding. To the best of our knowledge, this is the first study to introduce lightweight projection-based adaptation for thermal infrared ecological monitoring, supporting high-accuracy inference under both closed-set and open-set prompting. Experiments show that Qwen3-VL-8B-Instruct achieves the best performance in open-set settings, yielding F1 scores of 0.968, 0.915, and 0.935 for elephants, rhinos, and deer, respectively, with perfect individual counting accuracy (1.000) and effective generation of contextual habitat descriptions.
📝 Abstract
This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models, including InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, and Qwen3-VL-8B-Instruct, were benchmarked under both closed-set and open-set prompting conditions for species recognition and instance enumeration. Among the tested models, Qwen3-VL-8B-Instruct with open-set prompting achieved the best overall performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and within-1 enumeration accuracies of 0.779, 0.982, and 1.000, respectively. In addition, combining thermal imagery with simultaneously collected RGB imagery enabled the model to generate habitat-context information, including land-cover characteristics, key landscape features, and visible human disturbance. Overall, the findings demonstrate that lightweight projector-based adaptation provides an effective and practical route for transferring RGB-pretrained VLMs to thermal drone imagery, expanding their utility from object-level recognition to habitat-context interpretation in ecological monitoring.
Problem

Research questions and friction points this paper is trying to address.

vision language models
thermal infrared imagery
species recognition
habitat context interpretation
drone imagery
Innovation

Methods, ideas, or system contributions that make the work stand out.

lightweight multimodal adaptation
vision language models
thermal infrared imagery
projector alignment
habitat context interpretation
🔎 Similar Papers
2024-03-22IEEE transactions on circuits and systems for video technology (Print)Citations: 2
H
Hao Chen
Geospatial Information Science, The University of Texas at Dallas, Richardson, TX 75080, USA
F
Fang Qiu
Geospatial Information Science, The University of Texas at Dallas, Richardson, TX 75080, USA
F
Fangchao Dong
Geospatial Information Science, The University of Texas at Dallas, Richardson, TX 75080, USA
D
Defei Yang
Geospatial Information Science, The University of Texas at Dallas, Richardson, TX 75080, USA
E
Eve Bohnett
Department of Landscape Architecture, University of Florida, Gainesville, FL 32611, USA
Li An
Li An
Solon & Martha Dixon Endowed Professor, Auburn University
human-environmentlandscape ecologyGISciencecomplex systems