🤖 AI Summary
This study addresses the limitations of expert-dependent assessment in clinical simulation training—namely, its time intensity, poor scalability, and inconsistent scoring—by proposing a three-stage framework that extracts action timelines from egocentric nursing simulation videos. The approach integrates sequential features and recognition metrics to correlate with instructor-assigned competency levels. Leveraging a frozen DINOv2 visual encoder, few-shot learning, HMM Viterbi decoding, and sequence analysis, the method achieves 57.4% mean overlap F1-score (MOF) for 1-shot action recognition across 22 annotated sessions. A key finding reveals a significant negative correlation between action recognition accuracy and trainee competency (ρ = –0.524, p = 0.012), suggesting that higher-performing learners exhibit more complex, varied, and guideline-adherent behaviors, thereby increasing recognition difficulty. This difficulty itself emerges as a viable signal for instructional feedback, with results remaining robust across six confounding factors.
📝 Abstract
Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language models have emerged as a promising tool for understanding complex visual behavior. In this work, we investigate whether visual observations can provide educationally meaningful signals for competency assessment through a three-stage framework that (1) extracts action timelines from egocentric nursing simulation video using frozen visual encoders and few-shot learning, (2) derives sequence-level features and per-session recognition metrics, and (3) relates these to instructor-rated competency. Across 22 densely annotated sessions (3.8 hours, 493 actions), a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition. Surprisingly, we observe a negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), robust to six confound controls: more competent students produce diverse, harder-to-classify workflows, while simple sequence features show no such relationship. Per-item analysis identifies patient safety protocols and team communication as the expected behaviors most reflected in this pattern, and process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions. These findings suggest that recognition accuracy may complement predicted action timelines as a pedagogically informative signal in automated competency assessment.