๐ค AI Summary
This work addresses two key challenges in cross-species animal pose and shape reconstruction: insufficient model capacity and scarcity of high-quality multi-species 3D data. To this end, we propose the first unified visual framework for jointly modeling mammals and birds. Methodologically, we design a family-aware Vision Transformer incorporating a Mixture-of-Experts architecture to explicitly disentangle species-specific and shared representations. We further introduce CtrlAVES3Dโthe first large-scale 3D-annotated avian datasetโand leverage diffusion-based conditional synthesis to augment training with photorealistic synthetic images, enabling synergistic real-synthetic co-training. Trained on 413K mammalian and 124K avian images, our model achieves state-of-the-art performance on cross-domain benchmarks including Animal Kingdom, significantly outperforming prior approaches. The framework delivers a high-accuracy, scalable, and general-purpose spatial understanding tool for quantitative biological analysis.
๐ Abstract
In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.