🤖 AI Summary
Individual re-identification of wild western lowland gorillas relies heavily on manual annotation and suffers from a critical lack of large-scale, in-the-wild video datasets.
Method: We propose an end-to-end automated monitoring system featuring (i) a multi-frame self-supervised pretraining strategy leveraging trajectory consistency to learn domain-specific representations; (ii) differentiable AttnLRP for visualizing and validating model attention on biologically relevant features rather than background artifacts; and (iii) a spatiotemporally constrained clustering algorithm to mitigate over-segmentation and enhance robustness in unsupervised population counting.
Contribution/Results: We release the largest wild primate re-identification video dataset to date—including three newly collected datasets. Experiments demonstrate that aggregating features from image-based backbone networks outperforms dedicated video architectures. Our system significantly improves individual identification accuracy and population tracking efficiency in real-world field conditions.
📝 Abstract
Monitoring critically endangered western lowland gorillas is currently hampered by the immense manual effort required to re-identify individuals from vast archives of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, "in-the-wild" video datasets suitable for training robust deep learning models. To address this gap, we introduce a comprehensive benchmark with three novel datasets: Gorilla-SPAC-Wild, the largest video dataset for wild primate re-identification to date; Gorilla-Berlin-Zoo, for assessing cross-domain re-identification generalization; and Gorilla-SPAC-MoT, for evaluating multi-object tracking in camera trap footage. Building on these datasets, we present GorillaWatch, an end-to-end pipeline integrating detection, tracking, and re-identification. To exploit temporal information, we introduce a multi-frame self-supervised pretraining strategy that leverages consistency in tracklets to learn domain-specific features without manual labels. To ensure scientific validity, a differentiable adaptation of AttnLRP verifies that our model relies on discriminative biometric traits rather than background correlations. Extensive benchmarking subsequently demonstrates that aggregating features from large-scale image backbones outperforms specialized video architectures. Finally, we address unsupervised population counting by integrating spatiotemporal constraints into standard clustering to mitigate over-segmentation. We publicly release all code and datasets to facilitate scalable, non-invasive monitoring of endangered species