UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of unified representation across multimodal human pose modalities—RGB images, 2D keypoints, and 3D skeletons—as well as insufficient modeling of cross-modal correlations. We propose the first cross-modal unified representation framework based on singular value contrastive learning. Methodologically, we design a novel singular value decomposition (SVD)-guided contrastive loss that jointly aligns and enforces consistency among features from all three modalities in a shared latent space. Our key contribution is the first incorporation of singular value spectra into cross-modal pose alignment, explicitly capturing structural correlations across modalities. Experiments demonstrate state-of-the-art performance: 49.9 mm MPJPE on Human3.6M, 51.6 mm PA-MPJPE on 3DPW, and 9.24 mm retrieval error for bidirectional 2D↔3D pose retrieval—substantially advancing multimodal human pose understanding and generation.

Technology Category

Application Category

📝 Abstract
In recent years, there has been a growing interest in developing effective alignment pipelines to generate unified representations from different modalities for multi-modal fusion and generation. As an important component of Human-Centric applications, Human Pose representations are critical in many downstream tasks, such as Human Pose Estimation, Action Recognition, Human-Computer Interaction, Object tracking, etc. Human Pose representations or embeddings can be extracted from images, 2D keypoints, 3D skeletons, mesh models, and lots of other modalities. Yet, there are limited instances where the correlation among all of those representations has been clearly researched using a contrastive paradigm. In this paper, we propose UniHPR, a unified Human Pose Representation learning pipeline, which aligns Human Pose embeddings from images, 2D and 3D human poses. To align more than two data representations at the same time, we propose a novel singular value-based contrastive learning loss, which better aligns different modalities and further boosts performance. To evaluate the effectiveness of the aligned representation, we choose 2D and 3D Human Pose Estimation (HPE) as our evaluation tasks. In our evaluation, with a simple 3D human pose decoder, UniHPR achieves remarkable performance metrics: MPJPE 49.9mm on the Human3.6M dataset and PA-MPJPE 51.6mm on the 3DPW dataset with cross-domain evaluation. Meanwhile, we are able to achieve 2D and 3D pose retrieval with our unified human pose representations in Human3.6M dataset, where the retrieval error is 9.24mm in MPJPE.
Problem

Research questions and friction points this paper is trying to address.

Aligning human pose representations from multiple modalities
Developing contrastive learning for cross-modal pose alignment
Improving 2D/3D pose estimation and retrieval performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified pose representation learning from multiple modalities
Novel singular value-based contrastive learning loss
Alignment of image, 2D and 3D human pose embeddings
🔎 Similar Papers
No similar papers found.
Zhongyu Jiang
Zhongyu Jiang
Apple Inc.
Human Intelligence
Wenhao Chai
Wenhao Chai
Princeton University
Machine LearningComputer Vision
L
Lei Li
University of Copenhagen
Z
Zhuoran Zhou
University of Washington
Cheng-Yen Yang
Cheng-Yen Yang
University of Washington
Computer VisionDeep Learning
J
Jenq-Neng Hwang
University of Washington