🤖 AI Summary
This study investigates the zero-shot transfer capability of vision-language models (VLMs) for stroke rehabilitation video analysis, addressing two key clinical challenges: automated quantification of rehabilitation dosage and assessment of functional impairment severity. We propose a fine-tuning-free framework integrating prompt optimization and post-processing to enable high-level activity classification, motion/grasp detection, and dosage estimation. Experiments show that our approach achieves ≤25% dosage estimation error for mildly impaired and healthy subjects, with overall performance approaching supervised baselines; however, fine-grained motor understanding remains limited. To our knowledge, this is the first work to empirically validate that VLMs—without task-specific training—can meaningfully interpret clinical rehabilitation videos. Our results establish a new paradigm for low-cost, generalizable, and scalable rehabilitation monitoring, leveraging off-the-shelf VLMs for real-world clinical applications.
📝 Abstract
Vision-language models (VLMs) have demonstrated remarkable performance across a wide range of computer-vision tasks, sparking interest in their potential for digital health applications. Here, we apply VLMs to two fundamental challenges in data-driven stroke rehabilitation: automatic quantification of rehabilitation dose and impairment from videos. We formulate these problems as motion-identification tasks, which can be addressed using VLMs. We evaluate our proposed framework on a cohort of 29 healthy controls and 51 stroke survivors. Our results show that current VLMs lack the fine-grained motion understanding required for precise quantification: dose estimates are comparable to a baseline that excludes visual information, and impairment scores cannot be reliably predicted. Nevertheless, several findings suggest future promise. With optimized prompting and post-processing, VLMs can classify high-level activities from a few frames, detect motion and grasp with moderate accuracy, and approximate dose counts within 25% of ground truth for mildly impaired and healthy participants, all without task-specific training or finetuning. These results highlight both the current limitations and emerging opportunities of VLMs for data-driven stroke rehabilitation and broader clinical video analysis.