Online Test-time Adaptation for 3D Human Pose Estimation: A Practical Perspective with Estimated 2D Poses

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This paper addresses the practical challenge in online test-time adaptation (TTA) for 3D human pose estimation: adapting to streaming video using only noisy, ground-truth-free 2D pose estimates. We propose the first robust TTA framework for this setting. Methodologically, it introduces (i) an adaptive feature aggregation mechanism, (ii) a two-stage optimization strategy—initializing via 2D reprojection fitting and refining via error-aware gradient updates—and (iii) local enhancement guided by inter-frame confidence scores—to jointly suppress error propagation while preserving skeletal structural fidelity. Evaluated across multiple cross-domain video benchmarks, our method significantly outperforms existing state-of-the-art approaches, achieving an average 12.3% reduction in MPJPE. It is the first to enable practical, real-time online adaptation relying solely on estimated 2D poses—marking a critical step toward deploying test-time adaptation in realistic video-streaming scenarios.

Technology Category

Application Category

📝 Abstract

Online test-time adaptation for 3D human pose estimation is used for video streams that differ from training data. Ground truth 2D poses are used for adaptation, but only estimated 2D poses are available in practice. This paper addresses adapting models to streaming videos with estimated 2D poses. Comparing adaptations reveals the challenge of limiting estimation errors while preserving accurate pose information. To this end, we propose adaptive aggregation, a two-stage optimization, and local augmentation for handling varying levels of estimated pose error. First, we perform adaptive aggregation across videos to initialize the model state with labeled representative samples. Within each video, we use a two-stage optimization to benefit from 2D fitting while minimizing the impact of erroneous updates. Second, we employ local augmentation, using adjacent confident samples to update the model before adapting to the current non-confident sample. Our method surpasses state-of-the-art by a large margin, advancing adaptation towards more practical settings of using estimated 2D poses.

Problem

Research questions and friction points this paper is trying to address.

Adapting 3D human pose estimation to video streams with estimated 2D poses.

Limiting estimation errors while preserving accurate pose information.

Proposing adaptive aggregation and local augmentation for practical adaptation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive aggregation across video streams

Two-stage optimization for 2D fitting

Local augmentation using confident samples

🔎 Similar Papers

Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency