Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the low accuracy of smartwatch-based face-to-face dialogue detection in noisy real-world environments. We propose the first audio–motion bimodal fusion framework tailored for everyday scenarios. By synchronously capturing microphone audio and IMU motion signals from a smartwatch, we design three cross-modal data fusion strategies and integrate deep learning with traditional machine learning models to jointly model verbal content and nonverbal cues (e.g., nodding, gesturing). To our knowledge, this is the first end-to-end approach enabling face-to-face dialogue recognition under realistic acoustic noise. It achieves macro-F1 scores of 82.0±3.0% in controlled laboratory settings and 77.2±1.8% in semi-natural environments. These results robustly demonstrate the efficacy and resilience of multimodal wearable sensing for understanding complex acoustic social interactions.

Technology Category

Application Category

📝 Abstract
Social interactions play a crucial role in shaping human behavior, relationships, and societies. It encompasses various forms of communication, such as verbal conversation, non-verbal gestures, facial expressions, and body language. In this work, we develop a novel computational approach to detect a foundational aspect of human social interactions, in-person verbal conversations, by leveraging audio and inertial data captured with a commodity smartwatch in acoustically-challenging scenarios. To evaluate our approach, we conducted a lab study with 11 participants and a semi-naturalistic study with 24 participants. We analyzed machine learning and deep learning models with 3 different fusion methods, showing the advantages of fusing audio and inertial data to consider not only verbal cues but also non-verbal gestures in conversations. Furthermore, we perform a comprehensive set of evaluations across activities and sampling rates to demonstrate the benefits of multimodal sensing in specific contexts. Overall, our framework achieved 82.0$pm$3.0% macro F1-score when detecting conversations in the lab and 77.2$pm$1.8% in the semi-naturalistic setting.
Problem

Research questions and friction points this paper is trying to address.

Detect in-person conversations using smartwatch sensors
Fuse audio and motion data for accurate conversation detection
Evaluate performance in noisy real-world environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging smartwatch audio and motion data
Fusing audio and inertial data with ML/DL
Detecting conversations in noisy environments