π€ AI Summary
Static joint embeddings in category-agnostic pose estimation face two key challenges: cross-category semantic ambiguity (e.g., βlegβ exhibits drastically different visual appearances in humans versus furniture) and insufficient fine-grained instance discrimination (e.g., confounding joint representations across cats with varying poses or fur colors). To address these, we propose a dynamic support information modeling framework that jointly leverages textual semantic priors and category-level/instance-level visual cues from images via hierarchical cross-modal interaction and a dual-stream feature refinement mechanism, enabling dynamic semantic enhancement of joint embeddings. Our method employs a dual-stream architecture to support multi-granularity cross-modal alignment and interaction. On the MP-100 benchmark, it significantly outperforms prior art, establishing new state-of-the-art performance in both generalization capability and fine-grained pose discrimination.
π Abstract
Recent research in Category-Agnostic Pose Estimation (CAPE) has adopted fixed textual keypoint description as semantic prior for two-stage pose matching frameworks. While this paradigm enhances robustness and flexibility by disentangling the dependency of support images, our critical analysis reveals two inherent limitations of static joint embedding: (1) polysemy-induced cross-category ambiguity during the matching process(e.g., the concept "leg" exhibiting divergent visual manifestations across humans and furniture), and (2) insufficient discriminability for fine-grained intra-category variations (e.g., posture and fur discrepancies between a sleeping white cat and a standing black cat). To overcome these challenges, we propose a new framework that innovatively integrates hierarchical cross-modal interaction with dual-stream feature refinement, enhancing the joint embedding with both class-level and instance-specific cues from textual description and specific images. Experiments on the MP-100 dataset demonstrate that, regardless of the network backbone, CapeNext consistently outperforms state-of-the-art CAPE methods by a large margin.