HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing HOI (Human-Object Interaction) benchmarks lack fine-grained spatiotemporal reasoning capabilities to model dynamic hand-object interactions, particularly the object state and geometric changes induced by manipulation. Method: We introduce HanDyVQA—the first video question-answering benchmark explicitly designed for “action-effect” reasoning—covering six question types: action, process, object, location, state change, and part-level interaction, with segmentation masks enabling part-aware evaluation. Our approach employs a multiple-choice video QA framework that integrates video foundation models with explicit HOI cues to enhance dynamic perception. Results: State-of-the-art models (e.g., Gemini-2.5-Pro) achieve only 73% accuracy—substantially below human performance (97%)—revealing critical bottlenecks in spatial relation modeling, motion understanding, and geometric reasoning. This work establishes a novel benchmark and evaluation paradigm for fine-grained dynamic hand-object interaction modeling.

Technology Category

Application Category

📝 Abstract

Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding. We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.

Problem

Research questions and friction points this paper is trying to address.

Develops a video QA benchmark for fine-grained hand-object interaction dynamics

Addresses lack of spatio-temporal reasoning in existing hand-object interaction benchmarks

Evaluates models' ability to understand manipulation styles and part-level state changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained video QA benchmark for hand-object interactions

Six question types covering manipulation and effect aspects

Integration of explicit HOI cues to improve model performance

🔎 Similar Papers

Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?