🤖 AI Summary
This work addresses the real-time need for detecting human struggle—i.e., operational difficulty—in intelligent assistive systems. We propose the first online struggle anticipation framework, enabling streaming prediction of struggle events up to two seconds before occurrence. Methodologically, we design a lightweight, feature-driven end-to-end pipeline compatible with mainstream vision backbones, achieving 143 FPS feature extraction and 20 FPS full-stack real-time inference. Our key contributions are threefold: (1) moving beyond conventional offline classification, we formulate struggle recognition as an online detection and cross-task anticipation problem; (2) explicitly modeling skill evolution to enhance generalization across diverse activities (+4–20% mAP over random baselines); and (3) attaining 70–80% per-frame mAP in online detection with robust anticipation performance—meeting the dual practical requirements of low latency and cross-activity generalizability in assistive applications.
📝 Abstract
Understanding human skill performance is essential for intelligent assistive systems, with struggle recognition offering a natural cue for identifying user difficulties. While prior work focuses on offline struggle classification and localization, real-time applications require models capable of detecting and anticipating struggle online. We reformulate struggle localization as an online detection task and further extend it to anticipation, predicting struggle moments before they occur. We adapt two off-the-shelf models as baselines for online struggle detection and anticipation. Online struggle detection achieves 70-80% per-frame mAP, while struggle anticipation up to 2 seconds ahead yields comparable performance with slight drops. We further examine generalization across tasks and activities and analyse the impact of skill evolution. Despite larger domain gaps in activity-level generalization, models still outperform random baselines by 4-20%. Our feature-based models run at up to 143 FPS, and the whole pipeline, including feature extraction, operates at around 20 FPS, sufficient for real-time assistive applications.