🤖 AI Summary
This study addresses the current lack of systematic evaluation of video foundation models for cross-task effectiveness in remote Parkinson’s disease screening. For the first time, it comprehensively assesses seven prominent video foundation models—including VideoPrism, V-JEPA, and ViViT—on a large-scale real-world clinical video dataset, employing a frozen-embedding paradigm with linear classification heads across multiple clinical tasks. The results demonstrate area under the curve (AUC) scores ranging from 76.4% to 85.3%, with specificity as high as 90.3%, yet sensitivity remains relatively low (43.2–57.3%). These findings reveal a strong dependency between task performance and model architecture, offering critical guidance for model selection and future optimization in remote neurological disease monitoring.
📝 Abstract
Remote, video-based assessments offer a scalable pathway for Parkinson's disease (PD) screening. While traditional approaches rely on handcrafted features mimicking clinical scales, recent advances in video foundation models (VFMs) enable representation learning without task-specific customization. However, the comparative effectiveness of different VFM architectures across diverse clinical tasks remains poorly understood. We present a large-scale systematic study using a novel video dataset from 1,888 participants (727 with PD), comprising 32,847 videos across 16 standardized clinical tasks. We evaluate seven state-of-the-art VFMs -- including VideoPrism, V-JEPA, ViViT, and VideoMAE -- to determine their robustness in clinical screening. By evaluating frozen embeddings with a linear classification head, we demonstrate that task saliency is highly model-dependent: VideoPrism excels in capturing visual speech kinematics (no audio) and facial expressivity, while V-JEPA proves superior for upper-limb motor tasks. Notably, TimeSformer remains highly competitive for rhythmic tasks like finger tapping. Our experiments yield AUCs of 76.4-85.3% and accuracies of 71.5-80.6%. While high specificity (up to 90.3%) suggests strong potential for ruling out healthy individuals, the lower sensitivity (43.2-57.3%) highlights the need for task-aware calibration and integration of multiple tasks and modalities. Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring. Code and anonymized structured data are publicly available: https://anonymous.4open.science/r/parkinson\_video\_benchmarking-A2C5