🤖 AI Summary
To address the scarcity of high-quality annotated data for 2D animal pose estimation, this paper proposes a controllable multimodal image generation framework. Methodologically, we introduce the first modality-pose-caption-heterogeneous (MPCH) dataset, integrating visual modalities, pose annotations, and descriptive text; design three synthetic strategies—multimodal fusion, dynamic pose adjustment, and text-guided editing; and incorporate cross-modal feature alignment with controllable diffusion-based generation. Our contributions include: (1) constructing MPCH, the largest heterogeneous animal pose benchmark to date; (2) enabling on-demand synthesis of high-fidelity images spanning diverse poses and species; and (3) substantially improving downstream pose estimators across multiple animal categories—achieving a +12.6% average precision (AP) gain in keypoint detection and markedly enhancing cross-species generalization performance.
📝 Abstract
The task of 2D animal pose estimation plays a crucial role in advancing deep learning applications in animal behavior analysis and ecological research. Despite notable progress in some existing approaches, our study reveals that the scarcity of high-quality datasets remains a significant bottleneck, limiting the full potential of current methods. To address this challenge, we propose a novel Controllable Image Generation Pipeline for synthesizing animal pose estimation data, termed AP-CAP. Within this pipeline, we introduce a Multi-Modal Animal Image Generation Model capable of producing images with expected poses. To enhance the quality and diversity of the generated data, we further propose three innovative strategies: (1) Modality-Fusion-Based Animal Image Synthesis Strategy to integrate multi-source appearance representations, (2) Pose-Adjustment-Based Animal Image Synthesis Strategy to dynamically capture diverse pose variations, and (3) Caption-Enhancement-Based Animal Image Synthesis Strategy to enrich visual semantic understanding. Leveraging the proposed model and strategies, we create the MPCH Dataset (Modality-Pose-Caption Hybrid), the first hybrid dataset that innovatively combines synthetic and real data, establishing the largest-scale multi-source heterogeneous benchmark repository for animal pose estimation to date. Extensive experiments demonstrate the superiority of our method in improving both the performance and generalization capability of animal pose estimators.