đ€ AI Summary
This study investigates how fusing visible-light and thermal infrared imagery enhances automated detection of wildlifeâspecifically great blue herons and their nestsâaddressing the alignment and fusion challenges arising from inter-modal discrepancies in field-of-view and spatial resolution. We propose a deep learningâbased cross-modal auto-registration method and comparatively evaluate early fusion (via PCA) and late fusion (using CART-based classification and regression trees), with YOLOv11n as the backbone detector. Results demonstrate that dual-modal fusion significantly outperforms single-modality visible-light detection: late fusion improves the F1-score for the âoccupied nestâ class from 90.2% to 93.0%, while effectively suppressing bimodal false positives at 90% recall. This work empirically validates the discriminative value of thermal infrared cues in complex natural environments and establishes a reproducible technical framework for multimodal remote sensing in biodiversity monitoring.
đ Abstract
Efficient wildlife monitoring methods are necessary for biodiversity conservation and management. The combination of remote sensing, aerial imagery and deep learning offer promising opportunities to renew or improve existing survey methods. The complementary use of visible (VIS) and thermal infrared (TIR) imagery can add information compared to a single-source image and improve results in an automated detection context. However, the alignment and fusion process can be challenging, especially since visible and thermal images usually have different fields of view (FOV) and spatial resolutions. This research presents a case study on the great blue heron (Ardea herodias) to evaluate the performances of synchronous aerial VIS and TIR imagery to automatically detect individuals and nests using a YOLO11n model. Two VIS-TIR fusion methods were tested and compared: an early fusion approach and a late fusion approach, to determine if the addition of the TIR image gives any added value compared to a VIS-only model. VIS and TIR images were automatically aligned using a deep learning model. A principal component analysis fusion method was applied to VIS-TIR image pairs to form the early fusion dataset. A classification and regression tree was used to process the late fusion dataset, based on the detection from the VIS-only and TIR-only trained models. Across all classes, both late and early fusion improved the F1 score compared to the VIS-only model. For the main class, occupied nest, the late fusion improved the F1 score from 90.2 (VIS-only) to 93.0%. This model was also able to identify false positives from both sources with 90% recall. Although fusion methods seem to give better results, this approach comes with a limiting TIR FOV and alignment constraints that eliminate data. Using an aircraft-mounted very high-resolution visible sensor could be an interesting option for operationalizing surveys.