🤖 AI Summary
This study systematically evaluates the canopy segmentation performance of five state-of-the-art models—YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2—under extreme data scarcity, using only 150 annotated images. It investigates the generalization capabilities of convolutional versus Transformer-based architectures in few-shot remote sensing scenarios. Through tailored data augmentation and fine-tuning strategies, the work demonstrates that lightweight CNNs (YOLOv11 and Mask R-CNN) significantly outperform Transformer-based models. The findings highlight the critical influence of task type—semantic versus instance segmentation—and the role of architectural inductive biases in model selection under limited data conditions. This research provides both practical guidance and theoretical insights for ecological remote sensing applications with scarce labeled data.
📝 Abstract
Tree canopy detection from aerial imagery is an important task for environmental monitoring, urban planning, and ecosystem analysis. Simulating real-life data annotation scarcity, the Solafune Tree Canopy Detection competition provides a small and imbalanced dataset of only 150 annotated images, posing significant challenges for training deep models without severe overfitting. In this work, we evaluate five representative architectures, YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2, to assess their suitability for canopy segmentation under extreme data scarcity. Our experiments show that pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeeplabV3, Swin-UNet and DINOv2 underperform likely due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases. These findings confirm that transformer-based architectures struggle in low-data regimes without substantial pretraining or augmentation and that differences between semantic and instance segmentation further affect model performance. We provide a detailed analysis of training strategies, augmentation policies, and model behavior under the small-data constraint and demonstrate that lightweight CNN-based methods remain the most reliable for canopy detection on limited imagery.