🤖 AI Summary
Existing research is hindered by the lack of large-scale, multimodal, high-fidelity 3D motion and behavioral datasets, limiting progress in modeling complex scenarios such as single-person actions, gestures, locomotion, and multi-person dialogue or collaboration. To address this, we introduce the first large-scale dataset featuring high-precision 3D full-body and hand poses, synchronized multi-channel audio, fine-grained textual annotations, and diverse social interaction scenarios—comprising 439 participants, over 500 hours of multi-view motion capture data, and 54 million high-quality 3D motion frames. Leveraging multi-camera motion capture, speaker-separated audio, individualized audio recording, and collaborative behavior annotation, we achieve strict temporal alignment across modalities and rich semantic labeling. This dataset establishes new benchmarks for complex behavioral understanding and generation, significantly advancing research in virtual avatars, natural human–computer interaction, and computational social behavior analysis.
📝 Abstract
The Codec Avatars Lab at Meta introduces Embody 3D, a multimodal dataset of 500 individual hours of 3D motion data from 439 participants collected in a multi-camera collection stage, amounting to over 54 million frames of tracked 3D motion. The dataset features a wide range of single-person motion data, including prompted motions, hand gestures, and locomotion; as well as multi-person behavioral and conversational data like discussions, conversations in different emotional states, collaborative activities, and co-living scenarios in an apartment-like space. We provide tracked human motion including hand tracking and body shape, text annotations, and a separate audio track for each participant.