🤖 AI Summary
Addressing the challenge of annotating arbitrary point tracking under complex human motion—characterized by non-rigid deformations, articulated joint movements, clothing dynamics, and frequent occlusions—this paper introduces AnthroTAP, an automated pipeline for generating high-fidelity pseudo-trajectory labels. AnthroTAP leverages the SMPL parametric human model to synthesize realistic 3D motion sequences, followed by accurate 3D-to-2D vertex projection, optical flow consistency filtering, and ray-casting-based occlusion modeling. Crucially, it eliminates the need for manual annotation. Evaluated on the TAP-Vid benchmark, AnthroTAP achieves state-of-the-art performance—surpassing models trained exclusively on real video data. Moreover, it attains convergence in just one day using only four GPUs, requiring merely a tiny fraction (10⁴× less) of synthetic data compared to real-world counterparts. The approach thus delivers exceptional accuracy, computational efficiency, and scalability.
📝 Abstract
Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracking remains difficult due to laborious manual annotation. Our proposed pipeline, AnthroTAP, addresses this by proposing an automated pipeline to generate pseudo-labeled training data, leveraging the Skinned Multi-Person Linear (SMPL) model. We first fit the SMPL model to detected humans in video frames, project the resulting 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handle occlusions using ray-casting, and filter out unreliable tracks based on optical flow consistency. A point tracking model trained on AnthroTAP annotated dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using 10,000 times less data and only 1 day in 4 GPUs, compared to 256 GPUs used in recent state-of-the-art.