How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates visual foundation models (V-JEPA, Gemini Flash 2.0) against skeleton-based models (HD-GCN) for full-body gesture recognition in human–robot interaction under dynamic, noisy, agile manufacturing conditions. Experiments are conducted on the NUGGET dataset—specifically designed for human–machine communication—using multimodal inputs (RGB images, skeletal sequences, and textual prompts) and zero-shot protocols. Key contributions include: (1) the first empirical validation of V-JEPA as a lightweight, multi-task shared backbone for gesture recognition—achieving 97% of HD-GCN’s accuracy with only a simple linear classifier, thereby substantially reducing architectural complexity and inference latency; and (2) revealing critical limitations in current multimodal large language models (e.g., Gemini) for text-only zero-shot gesture classification, underscoring input representation design as a fundamental bottleneck. Results establish a new paradigm for model selection and architectural simplification toward low-latency, robust, nonverbal interaction in industrial settings.

Technology Category

Application Category

📝 Abstract
Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head - thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.
Problem

Research questions and friction points this paper is trying to address.

Comparing Foundation Models and skeleton-based approaches for gesture recognition
Evaluating V-JEPA, Gemini Flash 2.0, and HD-GCN on dynamic full-body gestures
Investigating potential of Foundation Models to reduce system complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting Vision Foundation Models for gesture recognition
Comparing V-JEPA, Gemini Flash 2.0, and HD-GCN
Introducing NUGGET dataset for human-robot communication
🔎 Similar Papers
No similar papers found.
S
Stephanie Käs
Chair for Computer Vision, RWTH Aachen University, Germany
A
Anton Burenko
Chair for Computer Vision, RWTH Aachen University, Germany
L
Louis Markert
Chair for Computer Vision, RWTH Aachen University, Germany
O
Onur Alp Culha
Chair for Computer Vision, RWTH Aachen University, Germany
D
Dennis Mack
Robert Bosch GmbH, Corporate Research & Bosch Center for AI, Renningen and Hildesheim, Germany
Timm Linder
Timm Linder
Research Scientist 3D Robot Perception, Bosch Research
Computer Vision3D Scene UnderstandingHRIRoboticsAutonomous Systems
Bastian Leibe
Bastian Leibe
Professor for Computer Vision, RWTH Aachen University
Computer VisionObject RecognitionTrackingScene Understanding