🤖 AI Summary
Embodied agents struggle to achieve human-level performance due to the scarcity of large-scale, real-time, multimodal, socially interactive datasets grounded in naturalistic environments. To address this, we introduce the first millisecond-aligned, five-modal dataset for multiplayer interaction in Minecraft—comprising synchronized video, in-game audio, microphone speech, mouse movements, and keyboard inputs—spanning over 10,000 hours of authentic gameplay. We propose a novel multimodal millisecond-synchronization acquisition framework, enabling the first large-scale recording of socially situated embodied behavior in open-world settings. We design a unified benchmark suite evaluating object recognition, spatial reasoning, language grounding, and long-horizon memory. Our infrastructure includes a high-precision logging system, custom capture plugins, privacy-preserving anonymization mechanisms, and a modular evaluation toolkit. As an initial release, we publicly share a curated 200-hour subset, establishing foundational data infrastructure for real-time, goal-directed embodied AI research.
📝 Abstract
Advances in deep generative modelling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants.footnote{We have done a privacy review for the public release of an initial 200-hour subset of the dataset, with plans to release most of the dataset over time.} Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding, and long-term memory. PLAICraft opens a path toward training and evaluating agents that act fluently and purposefully in real time, paving the way for truly embodied artificial intelligence.