AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses two key challenges in cross-species animal pose and shape reconstruction: insufficient model capacity and scarcity of high-quality multi-species 3D data. To this end, we propose the first unified visual framework for jointly modeling mammals and birds. Methodologically, we design a family-aware Vision Transformer incorporating a Mixture-of-Experts architecture to explicitly disentangle species-specific and shared representations. We further introduce CtrlAVES3D—the first large-scale 3D-annotated avian dataset—and leverage diffusion-based conditional synthesis to augment training with photorealistic synthetic images, enabling synergistic real-synthetic co-training. Trained on 413K mammalian and 124K avian images, our model achieves state-of-the-art performance on cross-domain benchmarks including Animal Kingdom, significantly outperforming prior approaches. The framework delivers a high-accuracy, scalable, and general-purpose spatial understanding tool for quantitative biological analysis.

Technology Category

Application Category

📝 Abstract

In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.

Problem

Research questions and friction points this paper is trying to address.

Unified pose and shape estimation for mammals and birds

Addressing limited network capacity and dataset scarcity

Enhancing accuracy with synthetic data and novel architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Family-aware ViT with MoE design

Diffusion-based synthetic data generation

First large-scale 3D bird dataset

🔎 Similar Papers

No similar papers found.