🤖 AI Summary
This work addresses the challenge of 2D ball trajectory segmentation failure in monocular broadcast videos caused by occlusions and multi-view ambiguities by introducing a novel “lift-then-segment” paradigm. The approach first lifts the entire 2D ball trajectory into 3D space using a learned neural network, then performs robust temporal segmentation and full 4D (3D + time) reconstruction based on the lifted trajectory. Integrating 2D-to-3D trajectory lifting, 3D temporal segmentation, ball spin estimation, camera calibration, human mesh recovery, and multi-modal annotation fusion, this method achieves the first complete 4D reconstruction of table tennis matches from monocular video under arbitrary viewpoints. The study also releases a large-scale dataset comprising over 140 hours of high-quality match footage, which has been successfully applied to train models for stroke pose estimation and adversarial rally generation.
📝 Abstract
We present TT4D, a large-scale, high-fidelity table tennis dataset. It provides $140+$ hours of reconstructed singles and doubles gameplay from monocular broadcast videos, featuring multimodal annotations like high-quality camera calibrations, precise 3D ball positions, ball spin, time segmentation, and 3D human meshes over time. This rich data provides a new foundation for virtual replay, in-depth player analysis, and robot learning. The dataset's combination of scale and precision is achieved through a novel reconstruction pipeline. Prior methods first partition a game sequence into individual shot segments based on the 2D ball track, and only then attempt reconstruction. However, 2D-based time segmentation collapses under occlusion and varied camera viewpoints, preventing reliable reconstruction. We invert this paradigm by first lifting the entire unsegmented 2D ball track to 3D through a learned lifting network. This 3D trajectory then allows us to reliably perform time segmentation. The learned lifting network also infers the ball's spin, handles unreliable ball detections, and successfully reconstructs the ball trajectory in cases of high occlusion. This lift-first design is necessary, as our pipeline is the only method capable of reconstructing table tennis gameplay from general-view broadcast monocular videos. We demonstrate the dataset's fidelity through two downstream tasks: estimating the racket's pose \& velocity at impact, and training a generative model of competitive rallies.