CapeNext: Rethinking and refining dynamic support information for category-agnostic pose estimation

πŸ“… 2025-11-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Static joint embeddings in category-agnostic pose estimation face two key challenges: cross-category semantic ambiguity (e.g., β€œleg” exhibits drastically different visual appearances in humans versus furniture) and insufficient fine-grained instance discrimination (e.g., confounding joint representations across cats with varying poses or fur colors). To address these, we propose a dynamic support information modeling framework that jointly leverages textual semantic priors and category-level/instance-level visual cues from images via hierarchical cross-modal interaction and a dual-stream feature refinement mechanism, enabling dynamic semantic enhancement of joint embeddings. Our method employs a dual-stream architecture to support multi-granularity cross-modal alignment and interaction. On the MP-100 benchmark, it significantly outperforms prior art, establishing new state-of-the-art performance in both generalization capability and fine-grained pose discrimination.

Technology Category

Application Category

πŸ“ Abstract
Recent research in Category-Agnostic Pose Estimation (CAPE) has adopted fixed textual keypoint description as semantic prior for two-stage pose matching frameworks. While this paradigm enhances robustness and flexibility by disentangling the dependency of support images, our critical analysis reveals two inherent limitations of static joint embedding: (1) polysemy-induced cross-category ambiguity during the matching process(e.g., the concept "leg" exhibiting divergent visual manifestations across humans and furniture), and (2) insufficient discriminability for fine-grained intra-category variations (e.g., posture and fur discrepancies between a sleeping white cat and a standing black cat). To overcome these challenges, we propose a new framework that innovatively integrates hierarchical cross-modal interaction with dual-stream feature refinement, enhancing the joint embedding with both class-level and instance-specific cues from textual description and specific images. Experiments on the MP-100 dataset demonstrate that, regardless of the network backbone, CapeNext consistently outperforms state-of-the-art CAPE methods by a large margin.
Problem

Research questions and friction points this paper is trying to address.

Addresses polysemy-induced ambiguity in cross-category pose matching
Solves insufficient discriminability for fine-grained intra-category variations
Overcomes limitations of static joint embedding in pose estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical cross-modal interaction for pose estimation
Dual-stream feature refinement for joint embedding
Class-level and instance-specific cues integration
πŸ”Ž Similar Papers
No similar papers found.