Evaluating point-light biological motion in multimodal large language models

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses a fundamental limitation of multimodal large language models (MLLMs) in understanding biological motion. We introduce ActPLD, the first MLLM-specific benchmark for point-light action recognition, grounded in minimalistic point-light displays (PLDs)—sparse spatiotemporal sequences of human joint trajectories. ActPLD systematically evaluates MLLMs on two core tasks: single-person action recognition and social interaction understanding. Experiments across leading closed- and open-source MLLMs reveal critical deficiencies in temporal dynamic modeling and embodied semantic parsing of body motion. Beyond establishing the first PLD-driven MLLM evaluation framework, this work pioneers a novel paradigm for assessing action understanding—leveraging embodiment-constrained stimuli to probe spatiotemporal reasoning capabilities. ActPLD thus provides both a rigorous diagnostic tool and concrete guidance for advancing MLLMs’ spatiotemporal reasoning and embodied intelligence.

Technology Category

Application Category

📝 Abstract
Humans can extract rich semantic information from minimal visual cues, as demonstrated by point-light displays (PLDs), which consist of sparse sets of dots localized to key joints of the human body. This ability emerges early in development and is largely attributed to human embodied experience. Since PLDs isolate body motion as the sole source of meaning, they represent key stimuli for testing the constraints of action understanding in these systems. Here we introduce ActPLD, the first benchmark to evaluate action processing in MLLMs from human PLDs. Tested models include state-of-the-art proprietary and open-source systems on single-actor and socially interacting PLDs. Our results reveal consistently low performance across models, introducing fundamental gaps in action and spatiotemporal understanding.
Problem

Research questions and friction points this paper is trying to address.

Evaluating action understanding in multimodal language models
Testing biological motion perception using point-light displays
Identifying gaps in spatiotemporal reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating action understanding via point-light displays
Testing multimodal models on sparse motion stimuli
Benchmarking biological motion perception in MLLMs
A
Akila Kadambi
Psychiatry and Biobehavioral Sciences, UCLA
M
Marco Iacoboni
Psychiatry and Biobehavioral Sciences, UCLA
L
Lisa Aziz-Zadeh
Brain and Creativity Institute, USC
Srini Narayanan
Srini Narayanan
Google DeepMind
Artificial IntelligenceComputational NeuroscienceCognitive Science