A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (LVLMs) lack support for human action understanding (HAU) and reasoning (HARn) due to the scarcity of large-scale non-RGB multimodal data (e.g., depth, IMU, millimeter-wave) and fine-grained action annotations. To address this gap, we introduce CUHK-X—the first large-scale multimodal dataset for HAU/HARn, comprising 58,445 samples spanning RGB, depth, IMU, and millimeter-wave modalities. We propose a prompt-driven scene generation framework, augmented with human verification, to produce logically coherent and temporally consistent action description sequences. Furthermore, we establish a three-task benchmark encompassing classification, descriptive generation, and causal reasoning. Experimental results demonstrate that state-of-the-art models achieve average accuracies of 76.52%, 40.76%, and 70.25% on recognition, understanding, and reasoning tasks, respectively—significantly bridging the data and evaluation gaps in fine-grained non-RGB action analysis.

Technology Category

Application Category

📝 Abstract
Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: https://openaiotlab.github.io/CUHK-X/ and https://github.com/openaiotlab/CUHK-X.
Problem

Research questions and friction points this paper is trying to address.

Addresses lack of multimodal datasets for human action understanding and reasoning
Proposes a method to generate consistent captions for fine-grained activity dynamics
Introduces benchmarks to evaluate multimodal human activity analysis tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CUHK-X multimodal dataset with 58,445 samples
Uses prompt-based scene creation with LLMs for consistent captions
Provides benchmarks for human action understanding and reasoning
🔎 Similar Papers
No similar papers found.
Siyang Jiang
Siyang Jiang
The Chinese University of Hong Kong
Foundation ModelsFederated LearningFew-Shot LearningAIoT
M
Mu Yuan
The Chinese University of Hong Kong, Hong Kong
X
Xiang Ji
The Chinese University of Hong Kong, Hong Kong
B
Bufang Yang
The Chinese University of Hong Kong, Hong Kong
Z
Zeyu Liu
University of Illinois Urbana-Champaign, United States
L
Lilin Xu
Columbia University, United States
Y
Yang Li
The Chinese University of Hong Kong, Hong Kong
Yuting He
Yuting He
Foundation Medicine Inc.
Precision MedicineBiomarker and CDxCancer GenomicsMachine LearningData Mining
L
Liran Dong
The Chinese University of Hong Kong, Hong Kong
W
Wenrui Lu
The Chinese University of Hong Kong, Hong Kong
Z
Zhenyu Yan
The Chinese University of Hong Kong, Hong Kong
Xiaofan Jiang
Xiaofan Jiang
Associate Professor of Electrical Engineering, Columbia University
Mobile and Embedded SystemsArtificial Intelligence of ThingsSmart Health and FitnessCPHS
W
Wei Gao
University of Pittsburgh, United States
H
Hongkai Chen
The Chinese University of Hong Kong, Hong Kong
Guoliang Xing
Guoliang Xing
The Chinese University of Hong Kong
Embedded AIAI for HealthAutonomous DrivingCyber-Physical SystemsWireless Networks