🤖 AI Summary
Existing mixed reality (MR) collaborative research relies on external devices or unimodal data, limiting real-time, deployable interaction awareness. This work proposes the first real-time, multimodal group interaction perception system leveraging only onboard MR headset sensors—speech, eye gaze, and spatial pose—eliminating the need for external cameras or offline annotation. By integrating temporal modeling with dynamic network analysis, the system automatically infers both static structural properties and transient behavioral patterns during collaboration, uncovering the coupling between behavioral dynamics and interaction network evolution. Evaluated with 48 participants organized into 12 four-person teams, the system accurately captures fine-grained collaborative transitions, demonstrating high effectiveness and practicality. This work establishes a scalable, lightweight sensing foundation for real-time collaborative support in MR environments.
📝 Abstract
Understanding how teams coordinate, share work, and negotiate roles in immersive environments is critical for designing effective mixed-reality (MR) applications that support real-time collaboration. However, existing methods either rely on external cameras and offline annotation or focus narrowly on single modalities, limiting their validity and applicability. To address this, we present a novel group interaction sensing toolkit (GIST), a deployable system that passively captures multi-modal interaction data, such as speech, gaze, and spatial proximity from commodity MR headset's sensors and automatically derives both overall static interaction networks and dynamic moment-by-moment behavior patterns. We evaluate GIST with a human subject study with 48 participants across 12 four-person groups performing an open-ended image-sorting task in MR. Our analysis shows strong alignment between the identified behavior modes and shifts in interaction network structure, confirming that momentary changes in speech, gaze, and proximity data are observable through the sensor data.