GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the challenge of spatial ambiguity among multiple visually similar objects in complex scenes, which existing vision–language–action (VLA) models struggle to resolve due to their sole reliance on textual instructions. To overcome this limitation, the study introduces gestures as a parallel instructional modality deeply integrated into the VLA framework, proposing a gesture feature embedding mechanism and a dual vision–language model (Dual-VLM) architecture that unifies high-level spatial reasoning with low-level action generation. A novel data generation pipeline is developed by rendering hand models onto real images, and a two-stage training strategy is employed to enhance gesture perception and action prediction. Experiments demonstrate that the proposed approach significantly improves target localization accuracy and human–robot interaction efficiency in real-world robotic tasks, particularly excelling in cluttered environments.

📝 Abstract

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To address this limitation, we introduce gesture as a parallel instruction modality and propose a Gesture-aware Vision-Language-Action model (GesVLA). Our approach encodes gesture features directly into the latent space, enabling them to participate in both high-level reasoning and low-level action generation, and adopts a dual-VLM architecture to achieve tight coupling between gesture representations and action policies. At the data level, we construct a scalable gesture data generation pipeline by rendering hand models onto real-world scene images. This reduces the sim-to-real visual gap while producing rich data with diverse motion patterns and corresponding pointing annotations. In addition, we employ a two-stage training strategy to equip the model with both gesture perception and action prediction capabilities. We evaluate our approach on multiple real-world robotic tasks, including a controlled block manipulation task for validation and more practical scenarios such as product and produce selection. Experimental results show that incorporating gesture consistently improves target grounding accuracy and human-robot interaction efficiency, especially in complex and cluttered environments. Project page: https://gwxuan.github.io/GesVLA/.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

spatial ambiguity

gesture

robot manipulation

target grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

gesture-aware

Vision-Language-Action

dual-VLM architecture