PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Embodied agents struggle to achieve human-level performance due to the scarcity of large-scale, real-time, multimodal, socially interactive datasets grounded in naturalistic environments. To address this, we introduce the first millisecond-aligned, five-modal dataset for multiplayer interaction in Minecraft—comprising synchronized video, in-game audio, microphone speech, mouse movements, and keyboard inputs—spanning over 10,000 hours of authentic gameplay. We propose a novel multimodal millisecond-synchronization acquisition framework, enabling the first large-scale recording of socially situated embodied behavior in open-world settings. We design a unified benchmark suite evaluating object recognition, spatial reasoning, language grounding, and long-horizon memory. Our infrastructure includes a high-precision logging system, custom capture plugins, privacy-preserving anonymization mechanisms, and a modular evaluation toolkit. As an initial release, we publicly share a curated 200-hour subset, establishing foundational data infrastructure for real-time, goal-directed embodied AI research.

Technology Category

Application Category

📝 Abstract

Advances in deep generative modelling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants.footnote{We have done a privacy review for the public release of an initial 200-hour subset of the dataset, with plans to release most of the dataset over time.} Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding, and long-term memory. PLAICraft opens a path toward training and evaluating agents that act fluently and purposefully in real time, paving the way for truly embodied artificial intelligence.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale multi-modal datasets for embodied AI

Need for time-aligned sensory-motor data in natural environments

Absence of benchmarks for real-time embodied agent evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal dataset with five time-aligned modalities

Millisecond precision for synchronous behavior study

Evaluation suite for benchmarking model capabilities

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4