CoMotion: Concurrent Multi-person 3D Motion

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of online 3D human pose estimation and tracking in crowded monocular video sequences. Methodologically, it departs from the conventional detection-then-matching paradigm by introducing a novel frame-to-frame pose propagation mechanism, augmented by a learnable, image-guided propagation module for occlusion-robust temporal modeling. To enhance generalization—particularly under severe occlusions—it employs cross-dataset pseudo-label co-training and self-supervised pseudo-labeling. Experiments demonstrate state-of-the-art accuracy in 3D pose estimation and significant improvements in multi-object tracking metrics: reduced ID switches, enhanced trajectory completeness, and improved stability—all while enabling real-time inference. Key contributions are: (1) the first end-to-end online 3D pose tracking framework eliminating explicit detection and association; (2) an image-guided, learnable pose propagation paradigm; and (3) empirical validation that cross-domain pseudo-label co-training substantially improves robustness in occluded scenarios.

Technology Category

Application Category

📝 Abstract
We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at https://github.com/apple/ml-comotion
Problem

Research questions and friction points this paper is trying to address.

Detect and track 3D poses of multiple people from monocular camera
Maintain coherent predictions in crowded scenes with occlusions
Achieve accurate online tracking through occlusion and pose updates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular camera for multi-person 3D tracking
Learned pose update for occlusion handling
Pseudo-labeled datasets boost accuracy
🔎 Similar Papers
No similar papers found.