Can Vision Language Models Judge Action Quality? An Empirical Evaluation

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study presents the first systematic evaluation of mainstream vision-language models—such as Gemini 1.5 Pro, Qwen-VL, and InternVL—in the domain of action quality assessment (AQA), with applications in physical education and rehabilitation therapy. Through empirical analysis across diverse activity types, visual representations (e.g., skeletal data), prompting strategies (including grounded instructions and in-context learning), and task formulations, the work reveals that current models perform only marginally above random guessing. The findings highlight pervasive systematic biases, notably a tendency to overlook fine-grained visual details and susceptibility to superficial linguistic cues. Moreover, conventional prompt engineering proves largely ineffective in mitigating these limitations. This research underscores fundamental shortcomings in existing vision-language models’ capacity for fine-grained action understanding and establishes a critical benchmark to guide future advancements in the field.

Technology Category

Application Category

📝 Abstract

Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Action Quality Assessment

Vision Language Models

Fine-grained Movement Evaluation

Systematic Bias

Real-world Deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Language Models

Action Quality Assessment

Systematic Bias