🤖 AI Summary
Current vision-language models exhibit significant limitations in fine-grained hand spatial reasoning—such as estimating joint angles, distances, and relative positions—hindering their applicability in high-precision domains like surgery, manufacturing, and AR/VR. To address this gap, this work introduces HandVQA, the first large-scale, controllable visual question answering benchmark for hand-centric spatial understanding. HandVQA automatically generates over 1.6 million structured multiple-choice questions from high-quality 3D hand datasets including FreiHAND, InterHand2.6M, and FPHA. The authors conduct systematic evaluations by lightly fine-tuning prominent models such as LLaVA, DeepSeek, and Qwen-VL using LoRA. Experiments reveal pervasive issues including finger hallucination, geometric misjudgment, and poor generalization. Notably, models trained on HandVQA achieve performance gains of 10.33% and 2.63% on downstream gesture recognition and hand-object interaction tasks, respectively, demonstrating strong zero-shot transfer of 3D spatial knowledge.
📝 Abstract
Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).