The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing egocentric video datasets are predominantly single-view, limiting multimodal understanding. To address this, we introduce the first large-scale, fully multimodal, unoccluded, spatiotemporally aligned multi-view dataset. We synchronously capture ultra-high-definition (UHD) 50-fps video from ten egocentric viewpoints and five fixed exocentric viewpoints, alongside high-fidelity audio and synchronized inertial measurement unit (IMU), GPS, and other sensor streams. This marks the first time 15 heterogeneous signal modalities have been aligned at millisecond precision, with no facial motion blur or audio distortion throughout. The dataset comprises over 600 hours of high-quality, openly accessible data, rigorously ensured for spatiotemporal consistency and modality completeness. By overcoming the data bottleneck in multi-view collaborative modeling, this work establishes a foundational resource for fine-grained action understanding, social interaction modeling, and embodied AI research.

Technology Category

Application Category

📝 Abstract

Egocentric video has seen increased interest in recent years, as it is used in a range of areas. However, most existing datasets are limited to a single perspective. In this paper, we present the CASTLE 2024 dataset, a multimodal collection containing ego- and exo-centric (i.e., first- and third-person perspective) video and audio from 15 time-aligned sources, as well as other sensor streams and auxiliary data. The dataset was recorded by volunteer participants over four days in a fixed location and includes the point of view of 10 participants, with an additional 5 fixed cameras providing an exocentric perspective. The entire dataset contains over 600 hours of UHD video recorded at 50 frames per second. In contrast to other datasets, CASTLE 2024 does not contain any partial censoring, such as blurred faces or distorted audio. The dataset is available via https://castle-dataset.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Addresses lack of multimodal datasets with multiple perspectives

Provides uncensored UHD video and audio from 15 synchronized sources

Enables research on combined ego- and exo-centric viewpoint analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset with 15 synchronized sources

Combines ego- and exo-centric video perspectives

Uncensored UHD video at 50fps

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs