🤖 AI Summary
Current wildlife multi-object tracking (MAT) datasets suffer from limited scale, low species diversity, and insufficient spatiotemporal coverage, hindering the development of generalizable models. To address this, we introduce WildTrack-99—the first large-scale, high-diversity wildlife MAT benchmark—comprising 46 hours of camera-trap video spanning 99 species, with over 940,000 bounding box annotations and 16,000+ instance segmentation masks; geographic metadata is anonymized to preserve privacy while enabling cross-regional generalization. This work uniquely unifies species diversity, broad geographic coverage, and high-fidelity spatiotemporal annotation. Leveraging WildTrack-99, we conduct a systematic evaluation of state-of-the-art vision-language models (e.g., SAM 3) and pure vision-based methods across detection, tracking, and individual re-identification tasks. Our benchmark establishes a reproducible foundation for behavioral analysis and population monitoring in ecological conservation.
📝 Abstract
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at $href{https://www.conservationxlabs.com/sa-fari}{ ext{conservationxlabs.com/SA-FARI}}$.