🤖 AI Summary
This paper addresses keypoint-based person identification in natural face-to-face conversational scenarios. We propose a dual-stream Transformer framework: a spatial stream models static body keypoint configurations, while a temporal stream employs a multi-scale temporal Transformer to hierarchically model motion dynamics using velocity features. Leveraging COCO WholeBody keypoints, we compare pretraining versus from-scratch training and perform feature-level fusion of the two streams. Experiments show that the spatial configuration achieves 95.74% accuracy, the temporal modeling attains 93.90%, and their fusion yields 98.03%—demonstrating strong complementarity between static pose structure and dynamic motion patterns. Our main contributions are: (1) the first dual-stream pose recognition architecture tailored for conversational settings; (2) a multi-scale temporal Transformer enabling fine-grained action modeling; and (3) systematic validation of the synergistic performance gain from joint spatial–temporal representation learning.
📝 Abstract
This paper investigates the performance of transformer-based architectures for person identification in natural, face-to-face conversation scenario. We implement and evaluate a two-stream framework that separately models spatial configurations and temporal motion patterns of 133 COCO WholeBody keypoints, extracted from a subset of the CANDOR conversational corpus. Our experiments compare pre-trained and from-scratch training, investigate the use of velocity features, and introduce a multi-scale temporal transformer for hierarchical motion modeling. Results demonstrate that domain-specific training significantly outperforms transfer learning, and that spatial configurations carry more discriminative information than temporal dynamics. The spatial transformer achieves 95.74% accuracy, while the multi-scale temporal transformer achieves 93.90%. Feature-level fusion pushes performance to 98.03%, confirming that postural and dynamic information are complementary. These findings highlight the potential of transformer architectures for person identification in natural interactions and provide insights for future multimodal and cross-cultural studies.