MA-Bench: Towards Fine-grained Micro-Action Understanding

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the absence of dedicated evaluation benchmarks for fine-grained micro-action understanding in current multimodal large language models (MLLMs), which hinders the assessment of their perceptual and reasoning capabilities regarding subtle human behaviors. To this end, we introduce MA-Bench—the first fine-grained benchmark specifically designed for micro-action understanding—comprising 1,000 videos and 12,000 structured question-answer pairs, along with a complementary training set, MA-Bench-Train, containing 20.5K videos. A three-tiered evaluation framework systematically measures model performance across micro-action perception, relational understanding, and explanatory reasoning. Experiments on 23 prominent MLLMs reveal significant limitations in modeling fine-grained action dynamics and body-part interactions. Notably, fine-tuning Qwen3-VL-8B on MA-Bench-Train substantially improves performance on reasoning and explanation tasks, demonstrating the benchmark’s effectiveness and practical utility.

Technology Category

Application Category

📝 Abstract

With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io

Problem

Research questions and friction points this paper is trying to address.

Micro-Action Understanding

Multimodal Large Language Models

Benchmark

Fine-grained Action Recognition

Human Behavior Analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Micro-Action Understanding

Multimodal Large Language Models

Fine-grained Benchmark