GazeInterpreter: Parsing Eye Gaze to Generate Eye-Body-Coordinated Narrations

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing behavioral understanding research largely overlooks eye gaze dynamics and their coordination with bodily actions. Method: We propose the first hierarchical framework for eye–body coordinated modeling: (1) fine-grained symbolic parsing of eye movement events; and (2) a multimodal, hierarchical semantic generation model integrating large language models with a self-correcting iterative mechanism to achieve spatiotemporal alignment between gaze and body motion and generate semantically coherent behavioral narratives. Contribution/Results: Our method is the first end-to-end approach enabling both symbolic eye movement parsing and joint behavioral description generation. It achieves significant performance gains on the text-driven action generation task in the Nymeria benchmark and demonstrates strong generalization across downstream tasks—including behavioral prediction and summarization—validating its effectiveness and robustness.

Technology Category

Application Category

📝 Abstract

Comprehensively interpreting human behavior is a core challenge in human-aware artificial intelligence. However, prior works typically focused on body behavior, neglecting the crucial role of eye gaze and its synergy with body motion. We present GazeInterpreter - a novel large language model-based (LLM-based) approach that parses eye gaze data to generate eye-body-coordinated narrations. Specifically, our method features 1) a symbolic gaze parser that translates raw gaze signals into symbolic gaze events; 2) a hierarchical structure that first uses an LLM to generate eye gaze narration at semantic level and then integrates gaze with body motion within the same observation window to produce integrated narration; and 3) a self-correcting loop that iteratively refines the modality match, temporal coherence, and completeness of the integrated narration. This hierarchical and iterative processing can effectively align physical values and semantic text in the temporal and spatial domains. We validated the effectiveness of our eye-body-coordinated narrations on the text-driven motion generation task in the large-scale Nymeria benchmark. Moreover, we report significant performance improvements for the sample downstream tasks of action anticipation and behavior summarization. Taken together, these results reveal the significant potential of parsing eye gaze to interpret human behavior and open up a new direction for human behavior understanding.

Problem

Research questions and friction points this paper is trying to address.

Interpreting human behavior by integrating eye gaze with body motion

Generating coordinated narrations using symbolic gaze parsing and LLMs

Improving action anticipation and behavior understanding through gaze analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based approach parsing eye gaze data

Hierarchical structure integrating gaze and body motion

Self-correcting loop refining narration coherence

🔎 Similar Papers

GazeCLIP: Enhancing Gaze Estimation Through Text-Guided Multimodal Learning