Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses keypoint-based person identification in natural face-to-face conversational scenarios. We propose a dual-stream Transformer framework: a spatial stream models static body keypoint configurations, while a temporal stream employs a multi-scale temporal Transformer to hierarchically model motion dynamics using velocity features. Leveraging COCO WholeBody keypoints, we compare pretraining versus from-scratch training and perform feature-level fusion of the two streams. Experiments show that the spatial configuration achieves 95.74% accuracy, the temporal modeling attains 93.90%, and their fusion yields 98.03%—demonstrating strong complementarity between static pose structure and dynamic motion patterns. Our main contributions are: (1) the first dual-stream pose recognition architecture tailored for conversational settings; (2) a multi-scale temporal Transformer enabling fine-grained action modeling; and (3) systematic validation of the synergistic performance gain from joint spatial–temporal representation learning.

Technology Category

Application Category

📝 Abstract
This paper investigates the performance of transformer-based architectures for person identification in natural, face-to-face conversation scenario. We implement and evaluate a two-stream framework that separately models spatial configurations and temporal motion patterns of 133 COCO WholeBody keypoints, extracted from a subset of the CANDOR conversational corpus. Our experiments compare pre-trained and from-scratch training, investigate the use of velocity features, and introduce a multi-scale temporal transformer for hierarchical motion modeling. Results demonstrate that domain-specific training significantly outperforms transfer learning, and that spatial configurations carry more discriminative information than temporal dynamics. The spatial transformer achieves 95.74% accuracy, while the multi-scale temporal transformer achieves 93.90%. Feature-level fusion pushes performance to 98.03%, confirming that postural and dynamic information are complementary. These findings highlight the potential of transformer architectures for person identification in natural interactions and provide insights for future multimodal and cross-cultural studies.
Problem

Research questions and friction points this paper is trying to address.

Identifying persons using conversational dynamics and body keypoints
Comparing spatial configurations versus temporal motion patterns effectiveness
Developing transformer architectures for natural interaction person identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stream transformer framework models spatial and temporal keypoints
Multi-scale temporal transformer enables hierarchical motion modeling
Feature-level fusion combines postural and dynamic information effectively
🔎 Similar Papers
No similar papers found.