ExAct: A Video-Language Benchmark for Expert Action Analysis

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant performance gap between current video-language models (VLMs) and human experts in domain-specific action understanding—e.g., sports, maintenance, and cooking. To this end, we introduce ExAct, the first benchmark explicitly designed for expert-level skill comprehension, covering six domains and eleven physical activity categories, with 3,521 expert-annotated video-question-answer pairs. We formally define and quantify fine-grained action semantic understanding as a discriminative multiple-choice task, employing a five-option format to increase cognitive and perceptual challenge. Experimental results reveal that the state-of-the-art VLM GPT-4o achieves only 44.70% zero-shot accuracy—substantially below the human expert baseline of 82.02%—highlighting fundamental limitations in domain knowledge integration and spatiotemporal action reasoning. ExAct supports both zero-shot and fine-tuning evaluation protocols; the dataset and code are publicly released to advance embodied intelligence and professional skill modeling.

Technology Category

Application Category

📝 Abstract
We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains. Dataset and code are available at https://texaser.github.io/exact_project_page/
Problem

Research questions and friction points this paper is trying to address.

Benchmark for expert-level video-language understanding of physical skills
Evaluates VLMs on nuanced comprehension of human procedural actions
Identifies performance gap between AI and human experts in skill analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-curated video QA pairs
Fine-grained expert-level understanding
Evaluates VLMs on human skills
🔎 Similar Papers
No similar papers found.
H
Han Yi
University of North Carolina at Chapel Hill
Yulu Pan
Yulu Pan
PhD Student, University of North Carolina at Chapel Hill
Machine LearningComputer VisionAI for Sports
F
Feihong He
University of North Carolina at Chapel Hill
X
Xinyu Liu
University of North Carolina at Chapel Hill
B
Benjamin Zhang
University of North Carolina at Chapel Hill
O
Oluwatumininu Oguntola
University of North Carolina at Chapel Hill
Gedas Bertasius
Gedas Bertasius
Assistant Professor, University of North Carolina at Chapel Hill
Computer VisionMachine Learning