π€ AI Summary
Existing boosting-based 3D human pose estimation (HPE) models exhibit limited generalization across datasets and in real-world scenarios. To address this, we propose AugLiftβa novel framework that, for the first time, incorporates keypoint confidence scores and monocular depth estimates as sparse auxiliary signals into the standard boosting pipeline. This extends the conventional 2D input representation from (x, y) to a four-dimensional (x, y, c, d) format, enabling seamless integration with mainstream lifting architectures without requiring additional training. The method is fully modular and relies solely on off-the-shelf pre-trained models to generate auxiliary signals. Evaluated on four benchmark datasets, AugLift reduces cross-dataset mean error by 10.1% and improves in-distribution performance by 4.0%, demonstrating substantial gains in robustness and generalization. AugLift establishes a new paradigm for lightweight, plug-and-play 3D HPE generalization.
π Abstract
Lifting-based methods for 3D Human Pose Estimation (HPE), which predict 3D poses from detected 2D keypoints, often generalize poorly to new datasets and real-world settings. To address this, we propose emph{AugLift}, a simple yet effective reformulation of the standard lifting pipeline that significantly improves generalization performance without requiring additional data collection or sensors. AugLift sparsely enriches the standard input -- the 2D keypoint coordinates $(x, y)$ -- by augmenting it with a keypoint detection confidence score $c$ and a corresponding depth estimate $d$. These additional signals are computed from the image using off-the-shelf, pre-trained models (e.g., for monocular depth estimation), thereby inheriting their strong generalization capabilities. Importantly, AugLift serves as a modular add-on and can be readily integrated into existing lifting architectures.
Our extensive experiments across four datasets demonstrate that AugLift boosts cross-dataset performance on unseen datasets by an average of $10.1%$, while also improving in-distribution performance by $4.0%$. These gains are consistent across various lifting architectures, highlighting the robustness of our method. Our analysis suggests that these sparse, keypoint-aligned cues provide robust frame-level context, offering a practical way to significantly improve the generalization of any lifting-based pose estimation model. Code will be made publicly available.