🤖 AI Summary
To address the challenge of real-time tracking of shared grounding in embodied collaborative dialogue, this paper proposes the first millisecond-level online shared grounding modeling framework. Methodologically, it integrates speech, gesture, motion, and eye-tracking signals through a lightweight multimodal fusion architecture, a real-time gaze-speech alignment model, and a propositional-logic-based distributed belief tracker, enabling dynamic inference and updating of group cognitive states. Its key contribution lies in being the first to jointly model multimodal behaviors and group belief evolution with end-to-end latency under 80 ms, thereby supporting truly online shared grounding construction. Evaluated on real-world collaborative tasks, the framework achieves 92.4% accuracy in proposition-level consensus identification—significantly outperforming existing offline approaches.
📝 Abstract
We present TRACE, a novel system for live *common ground* tracking in situated collaborative tasks. With a focus on fast, real-time performance, TRACE tracks the speech, actions, gestures, and visual attention of participants, uses these multimodal inputs to determine the set of task-relevant propositions that have been raised as the dialogue progresses, and tracks the group's epistemic position and beliefs toward them as the task unfolds. Amid increased interest in AI systems that can mediate collaborations, TRACE represents an important step forward for agents that can engage with multiparty, multimodal discourse.