HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) suffer from limited performance in human action video understanding due to the scarcity of high-quality annotated data. To address this, we propose a standardized video annotation paradigm tailored for fine-grained action understanding and introduce HAIC—the first open-source benchmark for human action understanding and generation. HAIC comprises HAICTrain, a training set of 126K high-quality video–text pairs, and HAICBench, an evaluation set containing 500 videos and 1,400 question-answer pairs. Our annotation methodology integrates human attribute modeling, temporal action structuring, Gemini-Pro–assisted labeling, and rigorous human verification, enabling both MLLM fine-tuning and comprehensive evaluation. Experiments demonstrate that models trained on HAIC achieve significant performance gains across four major human–action understanding benchmarks and notably improve text-to-video generation quality. The code and dataset are publicly released on Hugging Face.

Technology Category

Application Category

📝 Abstract
Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. extbf{HAICTrain} comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, extbf{HAICBench} includes 500 manually annotated video-caption pairs and 1,400 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at https://huggingface.co/datasets/KuaishouHAIC/HAIC.
Problem

Research questions and friction points this paper is trying to address.

Improving human action understanding in videos
Enhancing text-to-video generation quality
Addressing lack of high-quality action data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage data annotation pipeline for videos
Standardized caption format using human attributes
Curated datasets HAICTrain and HAICBench