🤖 AI Summary
This work addresses the ambiguity in multi-person 3D pose estimation during combat sports—characterized by rapid motions, severe occlusions, and tight interpersonal interactions under sparse multi-view settings. We propose a physics-aware joint optimization framework. Methodologically, it integrates Transformer-based multi-view 2D pose tracking, epipolar geometry constraints, long-term video object segmentation, and weighted triangulation. Crucially, we introduce the first multi-body physical trajectory joint optimization mechanism, incorporating kinematic constraints and rigid-body dynamics modeling to ensure spatiotemporal consistency and physical plausibility; spline-based smoothing and physics-informed refinement further enhance robustness. Our approach achieves state-of-the-art performance on a newly established elite boxing benchmark and multiple public datasets. To foster community advancement, we release a high-quality, manually annotated dataset.
📝 Abstract
We propose a novel framework for accurate 3D human pose estimation in combat sports using sparse multi-camera setups. Our method integrates robust multi-view 2D pose tracking via a transformer-based top-down approach, employing epipolar geometry constraints and long-term video object segmentation for consistent identity tracking across views. Initial 3D poses are obtained through weighted triangulation and spline smoothing, followed by kinematic optimization to refine pose accuracy. We further enhance pose realism and robustness by introducing a multi-person physics-based trajectory optimization step, effectively addressing challenges such as rapid motions, occlusions, and close interactions. Experimental results on diverse datasets, including a new benchmark of elite boxing footage, demonstrate state-of-the-art performance. Additionally, we release comprehensive annotated video datasets to advance future research in multi-person pose estimation for combat sports.