FingerCap: Fine-grained Finger-level Hand Motion Captioning

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of generating fine-grained, finger-level textual descriptions of hand gestures. We propose FiGOP, an extension of the classic Gestures of Parts (GOP) paradigm to finger-motion modeling, which jointly encodes RGB keyframes and hand keypoint sequences via a lightweight temporal encoder to recover high-frequency dynamic details. To support this task, we introduce FingerCap-40K—the first large-scale finger-action–text dataset comprising 40K diverse samples, covering both instructional gestures and natural hand–object interactions. We further propose HandJudge, a novel LLM-based evaluation framework that enables quantitative assessment of finger-level description accuracy and action completeness. Extensive experiments demonstrate that FiGOP significantly improves the finger-gesture understanding capability of existing multimodal video foundation models, achieving consistent gains in both automated metrics and human evaluations.

Technology Category

Application Category

📝 Abstract
Understanding fine-grained human hand motion is fundamental to visual perception, embodied intelligence, and multimodal communication. In this work, we propose Fine-grained Finger-level Hand Motion Captioning (FingerCap), which aims to generate textual descriptions that capture detailed finger-level semantics of hand actions. To support this task, we curate FingerCap-40K, a large-scale corpus of 40K paired hand-motion videos and captions spanning two complementary sources: concise instruction-style finger motions and diverse, naturalistic hand-object interactions. To enable effective evaluation, we employ HandJudge, a LLM-based rubric that measures finger-level correctness and motion completeness. Temporal sparsity remains a fundamental bottleneck for current Video-MLLMs, since sparse RGB sampling is insufficient to capture the subtle, high-frequency dynamics underlying fine finger motions. As a simple and compute-friendly remedy, we introduce FiGOP (Finger Group-of-Pictures), which pairs each RGB keyframe with subsequent hand keypoints until the next keyframe. A lightweight temporal encoder converts the keypoints into motion embeddings and integrates them with RGB features. FiGOP adapts the classic GOP concept to finger motion, recovering fine temporal cues without increasing RGB density. Experiments on FingerCap-40K show that strong open- and closed-source Video-MLLMs still struggle with finger-level reasoning, while our FiGOP-augmented model yield consistent gains under HandJudge and human studies.
Problem

Research questions and friction points this paper is trying to address.

Generating detailed finger-level textual descriptions of hand motions
Addressing temporal sparsity in capturing subtle finger motion dynamics
Creating evaluation methods for finger-level motion captioning accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

FiGOP pairs RGB keyframes with hand keypoints for motion
Lightweight temporal encoder converts keypoints to motion embeddings
FiGOP recovers fine temporal cues without increasing RGB density
🔎 Similar Papers
No similar papers found.