Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions

📅 2024-12-21
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses first-person interaction understanding in human-agent collaboration by proposing a novel three-task learning framework that jointly predicts user interaction intent, social attitude, and fine-grained action. Methodologically, we introduce SocialEgoNet—a hierarchical multi-task graph neural network—designed to explicitly model semantic dependencies among the three tasks. We construct a spatiotemporal graph from full-body skeletal data (face, hands, and torso), integrated with lightweight feature extraction and an end-to-end inference architecture enabling real-time prediction from just one second of video input. Evaluated on our newly curated JPL-Social dataset, the framework achieves a mean accuracy of 83.15%, significantly outperforming state-of-the-art baselines. This advancement enhances agent proactivity and response efficiency in interactive scenarios.

Technology Category

Application Category

📝 Abstract
For efficient human-agent interaction, an agent should proactively recognize their target user and prepare for upcoming interactions. We formulate this challenging problem as the novel task of jointly forecasting a person's intent to interact with the agent, their attitude towards the agent and the action they will perform, from the agent's (egocentric) perspective. So we propose emph{SocialEgoNet} - a graph-based spatiotemporal framework that exploits task dependencies through a hierarchical multitask learning approach. SocialEgoNet uses whole-body skeletons (keypoints from face, hands and body) extracted from only 1 second of video input for high inference speed. For evaluation, we augment an existing egocentric human-agent interaction dataset with new class labels and bounding box annotations. Extensive experiments on this augmented dataset, named JPL-Social, demonstrate emph{real-time} inference and superior performance (average accuracy across all tasks: 83.15%) of our model outperforming several competitive baselines. The additional annotations and code will be available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Forecasting human intent to interact with agents
Predicting human attitude and social actions
Real-time egocentric human-agent interaction analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based spatiotemporal framework for multitask learning
Uses whole-body skeletons for real-time inference
Augmented dataset with new labels and annotations
🔎 Similar Papers