Benchmarking Video Foundation Models for Remote Parkinson's Disease Screening

📅 2026-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the current lack of systematic evaluation of video foundation models for cross-task effectiveness in remote Parkinson’s disease screening. For the first time, it comprehensively assesses seven prominent video foundation models—including VideoPrism, V-JEPA, and ViViT—on a large-scale real-world clinical video dataset, employing a frozen-embedding paradigm with linear classification heads across multiple clinical tasks. The results demonstrate area under the curve (AUC) scores ranging from 76.4% to 85.3%, with specificity as high as 90.3%, yet sensitivity remains relatively low (43.2–57.3%). These findings reveal a strong dependency between task performance and model architecture, offering critical guidance for model selection and future optimization in remote neurological disease monitoring.

Technology Category

Application Category

📝 Abstract
Remote, video-based assessments offer a scalable pathway for Parkinson's disease (PD) screening. While traditional approaches rely on handcrafted features mimicking clinical scales, recent advances in video foundation models (VFMs) enable representation learning without task-specific customization. However, the comparative effectiveness of different VFM architectures across diverse clinical tasks remains poorly understood. We present a large-scale systematic study using a novel video dataset from 1,888 participants (727 with PD), comprising 32,847 videos across 16 standardized clinical tasks. We evaluate seven state-of-the-art VFMs -- including VideoPrism, V-JEPA, ViViT, and VideoMAE -- to determine their robustness in clinical screening. By evaluating frozen embeddings with a linear classification head, we demonstrate that task saliency is highly model-dependent: VideoPrism excels in capturing visual speech kinematics (no audio) and facial expressivity, while V-JEPA proves superior for upper-limb motor tasks. Notably, TimeSformer remains highly competitive for rhythmic tasks like finger tapping. Our experiments yield AUCs of 76.4-85.3% and accuracies of 71.5-80.6%. While high specificity (up to 90.3%) suggests strong potential for ruling out healthy individuals, the lower sensitivity (43.2-57.3%) highlights the need for task-aware calibration and integration of multiple tasks and modalities. Overall, this work establishes a rigorous baseline for VFM-based PD screening and provides a roadmap for selecting suitable tasks and architectures in remote neurological monitoring. Code and anonymized structured data are publicly available: https://anonymous.4open.science/r/parkinson\_video\_benchmarking-A2C5
Problem

Research questions and friction points this paper is trying to address.

Parkinson's disease
video foundation models
remote screening
clinical assessment
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Foundation Models
Parkinson's Disease Screening
Remote Assessment
Model Benchmarking
Clinical Video Analysis
🔎 Similar Papers
No similar papers found.
Md Saiful Islam
Md Saiful Islam
Computer Science, University of Rochester
digital healthhealth AImultimodal machine learningwearablesNLP
Ekram Hossain
Ekram Hossain
Professor, University of Manitoba, Canada, IEEE Fellow
Wireless communication networksradio resource allocationcognitive radiomulti-tier cellular networks
Abdelrahman Abdelkader
Abdelrahman Abdelkader
University of Rochester
Human Computer Interaction
Tariq Adnan
Tariq Adnan
PhD Student at Department of CSE, University of Rochester
Health AnalyticsBig Data AnalyticsCloud ComputingDistributed SystemsSocial Networks
F
Fazla Rabbi Mashrur
University of Rochester, Rochester, NY, USA
S
Sooyong Park
University of Rochester, Rochester, NY, USA
P
Praveen Kumar
University of Rochester, Rochester, NY, USA
Q
Qasim Sudais
University of Rochester, Rochester, NY, USA
N
Natalia Chunga
Louisiana State University Health Sciences Center at Shreveport, USA
N
Nami Shah
University of Rochester Medical Center, Rochester, NY, USA
J
Jan Freyberg
Google DeepMind, London, UK
Christopher Kanan
Christopher Kanan
University of Rochester
Artificial IntelligenceDeep LearningAGIMulti-Modal AICognitive Science
R
Ruth Schneider
University of Rochester Medical Center, Rochester, NY, USA
Ehsan Hoque
Ehsan Hoque
Professor of Computer Science, University of Rochester
affective computingcomputer visionspeech processingautism