🤖 AI Summary
Current large vision-language models (LVLMs) lack support for human action understanding (HAU) and reasoning (HARn) due to the scarcity of large-scale non-RGB multimodal data (e.g., depth, IMU, millimeter-wave) and fine-grained action annotations. To address this gap, we introduce CUHK-X—the first large-scale multimodal dataset for HAU/HARn, comprising 58,445 samples spanning RGB, depth, IMU, and millimeter-wave modalities. We propose a prompt-driven scene generation framework, augmented with human verification, to produce logically coherent and temporally consistent action description sequences. Furthermore, we establish a three-task benchmark encompassing classification, descriptive generation, and causal reasoning. Experimental results demonstrate that state-of-the-art models achieve average accuracies of 76.52%, 40.76%, and 70.25% on recognition, understanding, and reasoning tasks, respectively—significantly bridging the data and evaluation gaps in fine-grained non-RGB action analysis.
📝 Abstract
Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: https://openaiotlab.github.io/CUHK-X/ and https://github.com/openaiotlab/CUHK-X.