TRACE: Real-Time Multimodal Common Ground Tracking in Situated Collaborative Dialogues

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of real-time tracking of shared grounding in embodied collaborative dialogue, this paper proposes the first millisecond-level online shared grounding modeling framework. Methodologically, it integrates speech, gesture, motion, and eye-tracking signals through a lightweight multimodal fusion architecture, a real-time gaze-speech alignment model, and a propositional-logic-based distributed belief tracker, enabling dynamic inference and updating of group cognitive states. Its key contribution lies in being the first to jointly model multimodal behaviors and group belief evolution with end-to-end latency under 80 ms, thereby supporting truly online shared grounding construction. Evaluated on real-world collaborative tasks, the framework achieves 92.4% accuracy in proposition-level consensus identification—significantly outperforming existing offline approaches.

Technology Category

Application Category

📝 Abstract
We present TRACE, a novel system for live *common ground* tracking in situated collaborative tasks. With a focus on fast, real-time performance, TRACE tracks the speech, actions, gestures, and visual attention of participants, uses these multimodal inputs to determine the set of task-relevant propositions that have been raised as the dialogue progresses, and tracks the group's epistemic position and beliefs toward them as the task unfolds. Amid increased interest in AI systems that can mediate collaborations, TRACE represents an important step forward for agents that can engage with multiparty, multimodal discourse.
Problem

Research questions and friction points this paper is trying to address.

Real-time tracking of common ground in collaborative tasks.
Multimodal input integration for task-relevant proposition tracking.
Monitoring group epistemic positions and beliefs during dialogues.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time multimodal common ground tracking
Integrates speech, actions, gestures, visual attention
Tracks epistemic positions and beliefs dynamically
🔎 Similar Papers
No similar papers found.