🤖 AI Summary
This work addresses first-person interaction understanding in human-agent collaboration by proposing a novel three-task learning framework that jointly predicts user interaction intent, social attitude, and fine-grained action. Methodologically, we introduce SocialEgoNet—a hierarchical multi-task graph neural network—designed to explicitly model semantic dependencies among the three tasks. We construct a spatiotemporal graph from full-body skeletal data (face, hands, and torso), integrated with lightweight feature extraction and an end-to-end inference architecture enabling real-time prediction from just one second of video input. Evaluated on our newly curated JPL-Social dataset, the framework achieves a mean accuracy of 83.15%, significantly outperforming state-of-the-art baselines. This advancement enhances agent proactivity and response efficiency in interactive scenarios.
📝 Abstract
For efficient human-agent interaction, an agent should proactively recognize their target user and prepare for upcoming interactions. We formulate this challenging problem as the novel task of jointly forecasting a person's intent to interact with the agent, their attitude towards the agent and the action they will perform, from the agent's (egocentric) perspective. So we propose emph{SocialEgoNet} - a graph-based spatiotemporal framework that exploits task dependencies through a hierarchical multitask learning approach. SocialEgoNet uses whole-body skeletons (keypoints from face, hands and body) extracted from only 1 second of video input for high inference speed. For evaluation, we augment an existing egocentric human-agent interaction dataset with new class labels and bounding box annotations. Extensive experiments on this augmented dataset, named JPL-Social, demonstrate emph{real-time} inference and superior performance (average accuracy across all tasks: 83.15%) of our model outperforming several competitive baselines. The additional annotations and code will be available upon acceptance.