Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding methods struggle to assess the normative correctness of human actions and provide interpretable, actionable improvement suggestions, further hindered by the absence of benchmark datasets featuring fine-grained quality annotations and reasoning-based feedback. To address this, we introduce Action Form Assessment (AFA), a novel task for evaluating movement correctness in fitness and martial arts contexts. We present CoT-AFA—the first dataset supporting hierarchical quality annotations and chain-of-thought (CoT)–driven explanatory feedback. Methodologically, we propose a dual-stream vision–language parallel encoder coupled with a dynamic gating fusion mechanism, jointly enabling multimodal chain-of-thought reasoning and fine-grained quality modeling. Experiments demonstrate significant improvements over baselines: +16.0% CIDEr for explanation generation, +2.7% accuracy for action classification, and +2.1% accuracy for quality assessment. The code and dataset are publicly released to advance research in interpretable behavioral understanding.

Technology Category

Application Category

📝 Abstract
Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process--from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.
Problem

Research questions and friction points this paper is trying to address.

Assess human action standardization and provide feedback
Address lack of explainability in action quality assessment datasets
Propose multimodal reasoning for action analysis and solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Chain-of-Thought explanation paradigm for action assessment
Proposes dual-stream framework with dynamic gating for multimodal fusion
Creates CoT-AFA dataset with multi-level annotations for explainable evaluation
M
Mengshi Qi
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
Y
Yeteng Wu
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
X
Xianlin Zhang
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
Huadong Ma
Huadong Ma
BUPT
Internet of ThingsMultimedia