Action100M: A Large-scale Video Action Dataset

📅 2026-01-15
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation in video action understanding caused by the scarcity of large-scale, open-vocabulary datasets. The authors propose a fully automated pipeline to construct a massive action dataset comprising 1.2 million online instructional videos—amounting to 14.6 years of content—and generate approximately 100 million temporally localized action segments, each paired with rich textual descriptions. The pipeline integrates hierarchical temporal segmentation, a Tree-of-Captions multi-granularity captioning structure, and a multi-round self-refinement inference mechanism, leveraging V-JEPA 2 for segmentation and GPT-OSS-1 20B for structured caption refinement. A VL-JEPA model trained on this dataset significantly outperforms existing methods across multiple action recognition benchmarks, demonstrating exceptional zero-shot transfer capability and strong data scalability.

Technology Category

Application Category

📝 Abstract
Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.
Problem

Research questions and friction points this paper is trying to address.

video action dataset
open-vocabulary
large-scale
action recognition
machine intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action100M
hierarchical temporal segmentation
Tree-of-Captions
Self-Refine reasoning
open-vocabulary action dataset