🤖 AI Summary
This work addresses the challenge of online 3D human pose estimation and tracking in crowded monocular video sequences. Methodologically, it departs from the conventional detection-then-matching paradigm by introducing a novel frame-to-frame pose propagation mechanism, augmented by a learnable, image-guided propagation module for occlusion-robust temporal modeling. To enhance generalization—particularly under severe occlusions—it employs cross-dataset pseudo-label co-training and self-supervised pseudo-labeling. Experiments demonstrate state-of-the-art accuracy in 3D pose estimation and significant improvements in multi-object tracking metrics: reduced ID switches, enhanced trajectory completeness, and improved stability—all while enabling real-time inference. Key contributions are: (1) the first end-to-end online 3D pose tracking framework eliminating explicit detection and association; (2) an image-guided, learnable pose propagation paradigm; and (3) empirical validation that cross-domain pseudo-label co-training substantially improves robustness in occluded scenarios.
📝 Abstract
We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at https://github.com/apple/ml-comotion