Automatic infant 2D pose estimation from videos: comparing seven deep neural network methods

📅 2024-06-25
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work addresses the challenging problem of markerless 2D pose estimation for supine infants under complex, real-world conditions, with emphasis on generalization performance. We systematically benchmark seven state-of-the-art (SOTA) models—AlphaPose, DeepLabCut, Detectron2, HRNet, MediaPipe, OpenPose, and ViTPose—on a unified infant video benchmark, constituting the first comprehensive comparative evaluation in this domain. Methodologically, we introduce three novel evaluation paradigms: (1) neck-to-hip ratio error as a biomechanically informed metric, (2) systematic analysis of redundant and missed detections, and (3) confidence-score reliability assessment. Results show ViTPose achieves the highest accuracy; all models except DeepLabCut and MediaPipe demonstrate competitive performance without fine-tuning; AlphaPose attains near-real-time inference at 27 FPS. To ensure reproducibility, we fully containerize and open-source all code, annotated datasets, and evaluation results via OSF.

Technology Category

Application Category

📝 Abstract
Automatic markerless estimation of infant posture and motion from ordinary videos carries great potential for movement studies"in the wild", facilitating understanding of motor development and massively increasing the chances of early diagnosis of disorders. There is rapid development of human pose estimation methods in computer vision thanks to advances in deep learning and machine learning. However, these methods are trained on datasets that feature adults in different contexts. This work tests and compares seven popular methods (AlphaPose, DeepLabCut/DeeperCut, Detectron2, HRNet, MediaPipe/BlazePose, OpenPose, and ViTPose) on videos of infants in supine position and in more complex settings. Surprisingly, all methods except DeepLabCut and MediaPipe have competitive performance without additional finetuning, with ViTPose performing best. Next to standard performance metrics (average precision and recall), we introduce errors expressed in the neck-mid-hip (torso length) ratio and additionally study missed and redundant detections, and the reliability of the internal confidence ratings of the different methods, which are relevant for downstream tasks. Among the networks with competitive performance, only AlphaPose could run close to real time (27 fps) on our machine. We provide documented Docker containers or instructions for all the methods we used, our analysis scripts, and the processed data at https://hub.docker.com/u/humanoidsctu and https://osf.io/x465b/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating infant 2D pose estimation from videos using deep learning
Comparing seven methods for accuracy in infant posture and motion
Assessing performance metrics and real-time applicability for infant studies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares seven deep learning infant pose methods
Introduces torso length ratio error metrics
Provides Docker containers for method replication
🔎 Similar Papers
F
Filipe Gama
Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Cybernetics, Prague, Czech Republic
M
Matej Misar
Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Cybernetics, Prague, Czech Republic
L
Lukas Navara
Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Cybernetics, Prague, Czech Republic
S
S. T. Popescu
Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Cybernetics, Prague, Czech Republic
Matej Hoffmann
Matej Hoffmann
Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague
cognitive developmental roboticsbody representationsperipersonal spacecollaborative robotshuman-robot interaction