The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study investigates the zero-shot transfer capability of vision-language models (VLMs) for stroke rehabilitation video analysis, addressing two key clinical challenges: automated quantification of rehabilitation dosage and assessment of functional impairment severity. We propose a fine-tuning-free framework integrating prompt optimization and post-processing to enable high-level activity classification, motion/grasp detection, and dosage estimation. Experiments show that our approach achieves ≤25% dosage estimation error for mildly impaired and healthy subjects, with overall performance approaching supervised baselines; however, fine-grained motor understanding remains limited. To our knowledge, this is the first work to empirically validate that VLMs—without task-specific training—can meaningfully interpret clinical rehabilitation videos. Our results establish a new paradigm for low-cost, generalizable, and scalable rehabilitation monitoring, leveraging off-the-shelf VLMs for real-world clinical applications.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have demonstrated remarkable performance across a wide range of computer-vision tasks, sparking interest in their potential for digital health applications. Here, we apply VLMs to two fundamental challenges in data-driven stroke rehabilitation: automatic quantification of rehabilitation dose and impairment from videos. We formulate these problems as motion-identification tasks, which can be addressed using VLMs. We evaluate our proposed framework on a cohort of 29 healthy controls and 51 stroke survivors. Our results show that current VLMs lack the fine-grained motion understanding required for precise quantification: dose estimates are comparable to a baseline that excludes visual information, and impairment scores cannot be reliably predicted. Nevertheless, several findings suggest future promise. With optimized prompting and post-processing, VLMs can classify high-level activities from a few frames, detect motion and grasp with moderate accuracy, and approximate dose counts within 25% of ground truth for mildly impaired and healthy participants, all without task-specific training or finetuning. These results highlight both the current limitations and emerging opportunities of VLMs for data-driven stroke rehabilitation and broader clinical video analysis.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' capability to quantify rehabilitation dose from videos

Assessing VLMs' ability to measure movement impairment in stroke patients

Testing VLMs' motion understanding for clinical video analysis without fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLMs classify high-level activities from few frames

VLMs detect motion and grasp with moderate accuracy

VLMs approximate dose counts without task-specific training

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives