Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing free-hand gesture understanding methods (e.g., GestureGPT) suffer from low recognition accuracy, high response latency, and poor generalization to ambiguous or unconventional gestures. To address these limitations, this paper proposes an end-to-end real-time gesture semantic understanding framework. First, it introduces an anatomy-informed hand keypoint processing module to enhance the robustness of motion representation. Second, it integrates a large vision-language model (LVLM) with chain-of-thought (CoT) reasoning to enable interpretable, hierarchical mapping from dynamic gestures to high-level semantic intentions. Experiments demonstrate significant improvements over baseline methods in both accuracy and inference latency. Furthermore, we construct and publicly release the first large-scale free-hand gesture intention understanding dataset—comprising over 300,000 annotated question-answer pairs—establishing a new benchmark for the community.

Technology Category

Application Category

📝 Abstract
Free-form gesture understanding is highly appealing for human-computer interaction, as it liberates users from the constraints of predefined gesture categories. However, the sole existing solution GestureGPT suffers from limited recognition accuracy and slow response times. In this paper, we propose Gestura, an end-to-end system for free-form gesture understanding. Gestura harnesses a pre-trained Large Vision-Language Model (LVLM) to align the highly dynamic and diverse patterns of free-form gestures with high-level semantic concepts. To better capture subtle hand movements across different styles, we introduce a Landmark Processing Module that compensate for LVLMs' lack of fine-grained domain knowledge by embedding anatomical hand priors. Further, a Chain-of-Thought (CoT) reasoning strategy enables step-by-step semantic inference, transforming shallow knowledge into deep semantic understanding and significantly enhancing the model's ability to interpret ambiguous or unconventional gestures. Together, these components allow Gestura to achieve robust and adaptable free-form gesture comprehension. Additionally, we have developed the first open-source dataset for free-form gesture intention reasoning and understanding with over 300,000 annotated QA pairs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing real-time free-form gesture recognition accuracy
Bridging dynamic gesture patterns with semantic concepts
Overcoming slow response times in gesture interpretation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

LVLM aligns free-form gestures with semantic concepts
Landmark module embeds anatomical hand priors
Chain-of-Thought enables step-by-step semantic inference
🔎 Similar Papers
No similar papers found.
Z
Zhuoming Li
Institute of Artificial Intelligence (TeleAI) of China Telecom, China
A
Aitong Liu
Institute of Artificial Intelligence (TeleAI) of China Telecom, China
Mengxi Jia
Mengxi Jia
Peking University, Institute of Artificial Intelligence, China Telecom(中国电信人工智能研究院, TeleAI)
Machine LearningDeep Representation LearningLVLM
Tengxiang Zhang
Tengxiang Zhang
Goertek Inc, China
Dell Zhang
Dell Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom
Machine LearningInformation RetrievalNatural Language Processing
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI) of China Telecom, China