HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) suffer from limited performance in human action video understanding due to the scarcity of high-quality annotated data. To address this, we propose a standardized video annotation paradigm tailored for fine-grained action understanding and introduce HAIC—the first open-source benchmark for human action understanding and generation. HAIC comprises HAICTrain, a training set of 126K high-quality video–text pairs, and HAICBench, an evaluation set containing 500 videos and 1,400 question-answer pairs. Our annotation methodology integrates human attribute modeling, temporal action structuring, Gemini-Pro–assisted labeling, and rigorous human verification, enabling both MLLM fine-tuning and comprehensive evaluation. Experiments demonstrate that models trained on HAIC achieve significant performance gains across four major human–action understanding benchmarks and notably improve text-to-video generation quality. The code and dataset are publicly released on Hugging Face.

Technology Category

Application Category

📝 Abstract

Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. extbf{HAICTrain} comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, extbf{HAICBench} includes 500 manually annotated video-caption pairs and 1,400 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at https://huggingface.co/datasets/KuaishouHAIC/HAIC.

Problem

Research questions and friction points this paper is trying to address.

Improving human action understanding in videos

Enhancing text-to-video generation quality

Addressing lack of high-quality action data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage data annotation pipeline for videos

Standardized caption format using human attributes

Curated datasets HAICTrain and HAICBench

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs