🤖 AI Summary
Existing markerless 4D human motion capture methods suffer significant performance degradation in complex real-world scenarios—such as multi-person interactions, severe occlusions, and rapid positional exchanges—primarily due to the lack of high-quality training and evaluation data. To address this gap, this work introduces a new dataset and benchmark specifically designed for challenging markerless 4D human motion capture, systematically incorporating difficult multi-person interactions including frequent occlusions, rapid identity swaps among similarly dressed individuals, and dynamic inter-person distance variations. The dataset provides multi-view RGB-D videos, precise camera calibration, ground-truth 3D poses captured by a Vicon system, and corresponding SMPL/SMPL-X parameters. Benchmark evaluations reveal substantial performance drops in state-of-the-art methods under these conditions, while targeted fine-tuning demonstrates improved generalization, confirming the dataset’s challenge and practical utility.
📝 Abstract
Marker-based motion capture (MoCap) systems have long been the gold standard for accurate 4D human modeling, yet their reliance on specialized hardware and markers limits scalability and real-world deployment. Advancing reliable markerless 4D human motion capture requires datasets that reflect the complexity of real-world human interactions. Yet, existing benchmarks often lack realistic multi-person dynamics, severe occlusions, and challenging interaction patterns, leading to a persistent domain gap. In this work, we present a new dataset and evaluation for complex 4D markerless human motion capture. Our proposed MoCap dataset captures both single and multi-person scenarios with intricate motions, frequent inter-person occlusions, rapid position exchanges between similarly dressed subjects, and varying subject distances. It includes synchronized multi-view RGB and depth sequences, accurate camera calibration, ground-truth 3D motion capture from a Vicon system, and corresponding SMPL/SMPL-X parameters. This setup ensures precise alignment between visual observations and motion ground truth. Benchmarking state-of-the-art markerless MoCap models reveals substantial performance degradation under these realistic conditions, highlighting limitations of current approaches. We further demonstrate that targeted fine-tuning improves generalization, validating the dataset's realism and value for model development. Our evaluation exposes critical gaps in existing models and provides a rigorous foundation for advancing robust markerless 4D human motion capture.