EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the assessment of motor and cognitive functions through complex human behaviors in kitchen environments. We introduce the first densely annotated multimodal cooking dataset, comprising 29.7 hours of video recordings from 16 participants preparing four recipes, with synchronized acquisition of nine-view RGB-D, inertial measurement unit (IMU), HoloLens 2 eye-tracking, and 3D hand/body pose data. It features the first fine-grained action annotations at an unprecedented density of 33.78 segments per minute (22,415 total). We establish four novel benchmarks: vision-language understanding, text-to-action generation, multimodal action recognition, and pose-driven action segmentation. The dataset and code are publicly released. Our proposed methods achieve significant improvements over state-of-the-art approaches in action segmentation accuracy and cross-modal alignment robustness.

Technology Category

Application Category

📝 Abstract

Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal dataset, we propose four benchmarks to advance behavior understanding and modeling through 1) a vision-language benchmark, 2) a semantic text-to-motion generation benchmark, 3) a multi-modal action recognition benchmark, 4) a pose-based action segmentation benchmark. We expect the EPFL-Smart-Kitchen-30 dataset to pave the way for better methods as well as insights to understand the nature of ecologically-valid human behavior. Code and data are available at https://github.com/amathislab/EPFL-Smart-Kitchen

Problem

Research questions and friction points this paper is trying to address.

Capturing complex human behavior in kitchen environments using multi-modal data

Advancing behavior understanding through vision-language and motion generation benchmarks

Providing densely annotated dataset for action recognition and segmentation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view dataset with synchronized sensors

Densely annotated action sequences

Four benchmarks for behavior understanding

🔎 Similar Papers

LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

2024-06-13Citations: 2

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4