🤖 AI Summary
This study systematically evaluates visual foundation models (V-JEPA, Gemini Flash 2.0) against skeleton-based models (HD-GCN) for full-body gesture recognition in human–robot interaction under dynamic, noisy, agile manufacturing conditions. Experiments are conducted on the NUGGET dataset—specifically designed for human–machine communication—using multimodal inputs (RGB images, skeletal sequences, and textual prompts) and zero-shot protocols. Key contributions include: (1) the first empirical validation of V-JEPA as a lightweight, multi-task shared backbone for gesture recognition—achieving 97% of HD-GCN’s accuracy with only a simple linear classifier, thereby substantially reducing architectural complexity and inference latency; and (2) revealing critical limitations in current multimodal large language models (e.g., Gemini) for text-only zero-shot gesture classification, underscoring input representation design as a fundamental bottleneck. Results establish a new paradigm for model selection and architectural simplification toward low-latency, robust, nonverbal interaction in industrial settings.
📝 Abstract
Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head - thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.