🤖 AI Summary
This work proposes the first fully on-device, markerless monocular motion capture and deep learning analysis system that operates entirely on consumer-grade smartphones without relying on cloud infrastructure. Addressing the high cost, technical complexity, and privacy concerns that hinder the widespread adoption of clinical motion analysis, the system leverages a lightweight ViTPose-s model to perform 2D/3D pose estimation, skeletal refinement, action recognition, and vision-language reasoning, while exploiting on-chip neural accelerators for efficiency. On an iPhone 14, it processes 10 seconds of 4K@60fps video in just 77 seconds, achieving near real-time keypoint extraction and sub-millisecond latency for gait classification. The approach outperforms high-end cloud-based alternatives—including network transmission overhead—in overall efficiency while offering strong privacy guarantees, low cost, and high accessibility.
📝 Abstract
Motion capture is the gold standard for measuring human movement, but clinical use remains limited by cost, technical complexity, and privacy concerns. AIGaitor is a privacy-preserving, cloud-free motion analysis system that runs markerless monocular motion-capture pipelines and downstream deep-learning analysis entirely on a consumer smartphone using on-device neural accelerators. To motivate its design, we surveyed 74 rehabilitation clinicians: 92 percent said they would adopt an accurate, cost-effective, easy-to-use AI gait analysis tool, while 79.7 percent cited operating cost, 68.9 percent insufficient training, and 64.9 percent privacy concerns as leading barriers. We then optimized and benchmarked mobile iOS implementations of current monocular pipeline components, including 2D and 3D pose estimation, pose optimization, skeleton-based deep-learning analysis, and a vision-language model. A Time-Priority end-to-end on-device pipeline processes a 10 s 4K 60 fps video clip in 77 s on an iPhone 14, matching or beating the same pipeline on a high-end NVIDIA H200 cloud server when network transfer is included: 94 s at global mobile-average uplink and 66 s at developed-world Wi-Fi. Lightweight models such as ViTPose-s achieve real-time keypoint extraction, and skeleton-based action-recognition models provide sub-millisecond gait classification on the same clip. To our knowledge, AIGaitor is the first monocular system to demonstrate end-to-end on-device motion capture and downstream deep-learning analysis, supporting clinically applicable movement analysis that is low-cost, private, and accessible to smartphone users.