How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This study systematically evaluates visual foundation models (V-JEPA, Gemini Flash 2.0) against skeleton-based models (HD-GCN) for full-body gesture recognition in human–robot interaction under dynamic, noisy, agile manufacturing conditions. Experiments are conducted on the NUGGET dataset—specifically designed for human–machine communication—using multimodal inputs (RGB images, skeletal sequences, and textual prompts) and zero-shot protocols. Key contributions include: (1) the first empirical validation of V-JEPA as a lightweight, multi-task shared backbone for gesture recognition—achieving 97% of HD-GCN’s accuracy with only a simple linear classifier, thereby substantially reducing architectural complexity and inference latency; and (2) revealing critical limitations in current multimodal large language models (e.g., Gemini) for text-only zero-shot gesture classification, underscoring input representation design as a fundamental bottleneck. Results establish a new paradigm for model selection and architectural simplification toward low-latency, robust, nonverbal interaction in industrial settings.

Technology Category

Application Category

📝 Abstract

Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head - thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.

Problem

Research questions and friction points this paper is trying to address.

Comparing Foundation Models and skeleton-based approaches for gesture recognition

Evaluating V-JEPA, Gemini Flash 2.0, and HD-GCN on dynamic full-body gestures

Investigating potential of Foundation Models to reduce system complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting Vision Foundation Models for gesture recognition

Comparing V-JEPA, Gemini Flash 2.0, and HD-GCN

Introducing NUGGET dataset for human-robot communication

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey