🤖 AI Summary
Addressing the challenge of modeling and generating human daily activities in realistic home environments, this paper introduces ParaHome: the first system to simultaneously deploy 70 RGB cameras alongside wearable IMU suits and gesture-sensing gloves in natural domestic settings, enabling high-fidelity, synchronized capture of full-body motion, dexterous hand manipulation, and multi-object 3D interactions. Methodologically, it proposes a novel data collection paradigm—text-annotated sequential and concurrent human-object interactions—and introduces a parameterized articulated 3D object representation. The framework integrates multi-view vision, inertial motion capture, hand kinematic modeling, and generative modeling. As a key contribution, we release the first large-scale, high-quality 3D human-object interaction dataset, comprising 38 subjects, 207 activity sequences, and 486 minutes of richly annotated data. This benchmark significantly advances research in understanding, simulating, and generating everyday human activities.
📝 Abstract
To enable machines to understand the way humans interact with the physical world in daily life, 3D interaction signals should be captured in natural settings, allowing people to engage with multiple objects in a range of sequential and casual manipulations. To achieve this goal, we introduce our ParaHome system designed to capture dynamic 3D movements of humans and objects within a common home environment. Our system features a multi-view setup with 70 synchronized RGB cameras, along with wearable motion capture devices including an IMU-based body suit and hand motion capture gloves. By leveraging the ParaHome system, we collect a new human-object interaction dataset, including 486 minutes of sequences across 207 captures with 38 participants, offering advancements with three key aspects: (1) capturing body motion and dexterous hand manipulation motion alongside multiple objects within a contextual home environment; (2) encompassing sequential and concurrent manipulations paired with text descriptions; and (3) including articulated objects with multiple parts represented by 3D parameterized models. We present detailed design justifications for our system, and perform key generative modeling experiments to demonstrate the potential of our dataset.