Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the challenge of precisely conveying human intent through language instructions in robotic manipulation, particularly in scenarios involving ambiguous object selection, imprecise action localization, and dynamically changing targets. To overcome these limitations, the authors propose Gaze2Act, a novel framework that leverages human gaze as a dynamic intention signal. Gaze2Act introduces cross-view semantic alignment to map first-person gaze observations into the robot’s egocentric perspective, generating object masks and precise gaze points. By fusing visual, linguistic, and gaze cues at both perceptual and action levels, the framework enables coarse-to-fine target specification and fine-grained manipulation under dynamic guidance. Evaluated on the Unitree G1 humanoid robot across 16 tasks spanning seven categories, Gaze2Act achieves state-of-the-art performance in both intention accuracy and task success rate, significantly outperforming existing approaches.

📝 Abstract

Vision-Language-Action (VLA) models have recently shown strong potential for robot learning by following language instructions. However, in practice, language alone is often insufficient to precisely convey human intent. It is difficult to describe which exact object to interact with among similar candidates, where to act on the object, or how the target may change during execution. To address this limitation, we propose Gaze2Act, a novel VLA framework that leverages human gaze as a dynamic and intuitive intent signal for complex interactive manipulation. Gaze2Act first bridges the ego-exo view gap by mapping first-person gaze into the robot's perspective through cross-view semantic matching, producing both an object mask and a gaze point for coarse-to-fine target specification. These cues are then integrated into the policy through perception-level prompting and action-level conditioning, allowing the robot to attend to relevant regions and execute precise interactions under dynamic intent. In a systematic evaluation across seven task categories and 16 real-robot tasks on a Unitree G1 humanoid, Gaze2Act achieves state-of-the-art performance in both intent accuracy and task success rate. It notably outperforms baselines in object disambiguation, fine-grained interaction, and dynamic intent steering. These results demonstrate that human gaze provides a natural, low-burden, and highly expressive modality for human-in-the-loop VLA control.

Problem

Research questions and friction points this paper is trying to address.

human intent

language ambiguity

object disambiguation

interactive manipulation

gaze guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

gaze-conditioned policy

vision-language-action

cross-view semantic matching