AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

๐Ÿ“… 2025-07-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses two key challenges in cross-species animal pose and shape reconstruction: insufficient model capacity and scarcity of high-quality multi-species 3D data. To this end, we propose the first unified visual framework for jointly modeling mammals and birds. Methodologically, we design a family-aware Vision Transformer incorporating a Mixture-of-Experts architecture to explicitly disentangle species-specific and shared representations. We further introduce CtrlAVES3Dโ€”the first large-scale 3D-annotated avian datasetโ€”and leverage diffusion-based conditional synthesis to augment training with photorealistic synthetic images, enabling synergistic real-synthetic co-training. Trained on 413K mammalian and 124K avian images, our model achieves state-of-the-art performance on cross-domain benchmarks including Animal Kingdom, significantly outperforming prior approaches. The framework delivers a high-accuracy, scalable, and general-purpose spatial understanding tool for quantitative biological analysis.

Technology Category

Application Category

๐Ÿ“ Abstract
In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.
Problem

Research questions and friction points this paper is trying to address.

Unified pose and shape estimation for mammals and birds
Addressing limited network capacity and dataset scarcity
Enhancing accuracy with synthetic data and novel architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Family-aware ViT with MoE design
Diffusion-based synthetic data generation
First large-scale 3D bird dataset
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jin Lyu
Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China
Liang An
Liang An
Tsinghua University
3D visionhuman motion captureanimal motion capture
L
Li Lin
Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China, also with Jiaxing Research Institute, Southern University of Science and Technology, Jiaxing, China, and also with Department of Electrical and Electronic Engineering, the University of Hong Kong, Hong Kong, China
P
Pujin Cheng
Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China, also with Jiaxing Research Institute, Southern University of Science and Technology, Jiaxing, China, and also with Department of Electrical and Electronic Engineering, the University of Hong Kong, Hong Kong, China
Yebin Liu
Yebin Liu
Professor, Tsinghua University
Computer GraphicsComputational Photography3D VisionDigital Humans
X
Xiaoying Tang
Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China, and also with Jiaxing Research Institute, Southern University of Science and Technology, Jiaxing, China