🤖 AI Summary
Multi-species animal pose estimation faces significant generalization bottlenecks due to large inter-species visual discrepancies, long-tailed training data distributions, and cross-modal misalignment. To address these challenges, we propose a probabilistic prompt modeling framework that introduces, for the first time, a text-semantics-driven probabilistic sampling mechanism coupled with a diversity-constrained loss. Our method integrates learnable prompt optimization with three spatial-level cross-modal fusion strategies, substantially enhancing pose generalization to unseen species. Leveraging pre-trained vision-language models (e.g., CLIP), it avoids fine-tuning the visual backbone. On a multi-species animal pose benchmark, our approach achieves state-of-the-art performance under both supervised learning and zero-shot transfer settings. It is the first work to systematically resolve prompt robustness and cross-modal synergy issues in cross-species pose estimation.
📝 Abstract
Multi-species animal pose estimation has emerged as a challenging yet critical task, hindered by substantial visual diversity and uncertainty. This paper challenges the problem by efficient prompt learning for Vision-Language Pretrained (VLP) models, extit{e.g.} CLIP, aiming to resolve the cross-species generalization problem. At the core of the solution lies in the prompt designing, probabilistic prompt modeling and cross-modal adaptation, thereby enabling prompts to compensate for cross-modal information and effectively overcome large data variances under unbalanced data distribution. To this end, we propose a novel probabilistic prompting approach to fully explore textual descriptions, which could alleviate the diversity issues caused by long-tail property and increase the adaptability of prompts on unseen category instance. Specifically, we first introduce a set of learnable prompts and propose a diversity loss to maintain distinctiveness among prompts, thus representing diverse image attributes. Diverse textual probabilistic representations are sampled and used as the guidance for the pose estimation. Subsequently, we explore three different cross-modal fusion strategies at spatial level to alleviate the adverse impacts of visual uncertainty. Extensive experiments on multi-species animal pose benchmarks show that our method achieves the state-of-the-art performance under both supervised and zero-shot settings. The code is available at https://github.com/Raojiyong/PPAP.