🤖 AI Summary
Current wildlife video datasets are limited in scale (~2,400 clips of 15 frames), exhibit narrow scene diversity, and lack animal-centric annotations and temporal consistency required for 3D/4D reconstruction. To address these limitations, we propose the first fully automated pipeline for野外 animal video mining and processing: it harvests and trims animal-centered videos at scale from YouTube, then performs sequence-level pose annotation and refinement. This yields Animal-in-Motion—the first benchmark tailored for quadruped 4D reconstruction—comprising 230 high-quality sequences, 30K videos, and 2M frames. The benchmark exposes a significant discrepancy between conventional 2D evaluation metrics and 3D geometric plausibility. Furthermore, it establishes the first model-agnostic 4D reconstruction baseline, substantially improving accuracy and generalization in markerless 3D dynamic reconstruction under natural motion conditions.
📝 Abstract
Computer vision for animals holds great promise for wildlife research but often depends on large-scale data, while existing collection methods rely on controlled capture setups. Recent data-driven approaches show the potential of single-view, non-invasive analysis, yet current animal video datasets are limited--offering as few as 2.4K 15-frame clips and lacking key processing for animal-centric 3D/4D tasks. We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips, along with auxiliary annotations valuable for downstream tasks like pose estimation, tracking, and 3D/4D reconstruction. Using this pipeline, we amass 30K videos (2M frames)--an order of magnitude more than prior works. To demonstrate its utility, we focus on the 4D quadruped animal reconstruction task. To support this task, we present Animal-in-Motion (AiM), a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions. We evaluate state-of-the-art model-based and model-free methods on Animal-in-Motion, finding that 2D metrics favor the former despite unrealistic 3D shapes, while the latter yields more natural reconstructions but scores lower--revealing a gap in current evaluation. To address this, we enhance a recent model-free approach with sequence-level optimization, establishing the first 4D animal reconstruction baseline. Together, our pipeline, benchmark, and baseline aim to advance large-scale, markerless 4D animal reconstruction and related tasks from in-the-wild videos. Code and datasets are available at https://github.com/briannlongzhao/Animal-in-Motion.