🤖 AI Summary
Addressing the challenge of jointly modeling cross-individual, long-term nonverbal and verbal signals—such as posture, speech, and co-dining behaviors—in multi-person social interactions, this paper proposes a person-aware causal Transformer framework. The method introduces a modality-temporal dual-dimensional block-wise attention masking mechanism, integrated with person identity encoding and multimodal embedding fusion, enabling unified temporal modeling of multi-participant, multimodal signals and effective capture of long-range social dynamics. Evaluated on the HHCD dataset, the approach achieves significant improvements in predicting bite timing and speaking states, demonstrating that joint multimodal modeling outperforms unimodal or decoupled baselines. The source code is publicly available.
📝 Abstract
Understanding social signals in multi-party conversations is important for human-robot interaction and artificial social intelligence. Multi-party interactions include social signals like body pose, head pose, speech, and context-specific activities like acquiring and taking bites of food when dining. Incorporating all the multimodal signals in a multi-party interaction is difficult, and past work tends to build task-specific models for predicting social signals. In this work, we address the challenge of predicting multimodal social signals in multi-party settings in a single model. We introduce M3PT, a causal transformer architecture with modality and temporal blockwise attention masking which allows for the simultaneous processing of multiple social cues across multiple participants and their temporal interactions. This approach better captures social dynamics over time by considering longer horizons of social signals between individuals. We train and evaluate our unified model on the Human-Human Commensality Dataset (HHCD), and demonstrate that using multiple modalities improves bite timing and speaking status prediction. Source code: https://github.com/AbrarAnwar/masked-social-signals/