Evaluating point-light biological motion in multimodal large language models

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study addresses a fundamental limitation of multimodal large language models (MLLMs) in understanding biological motion. We introduce ActPLD, the first MLLM-specific benchmark for point-light action recognition, grounded in minimalistic point-light displays (PLDs)—sparse spatiotemporal sequences of human joint trajectories. ActPLD systematically evaluates MLLMs on two core tasks: single-person action recognition and social interaction understanding. Experiments across leading closed- and open-source MLLMs reveal critical deficiencies in temporal dynamic modeling and embodied semantic parsing of body motion. Beyond establishing the first PLD-driven MLLM evaluation framework, this work pioneers a novel paradigm for assessing action understanding—leveraging embodiment-constrained stimuli to probe spatiotemporal reasoning capabilities. ActPLD thus provides both a rigorous diagnostic tool and concrete guidance for advancing MLLMs’ spatiotemporal reasoning and embodied intelligence.

Technology Category

Application Category

📝 Abstract

Humans can extract rich semantic information from minimal visual cues, as demonstrated by point-light displays (PLDs), which consist of sparse sets of dots localized to key joints of the human body. This ability emerges early in development and is largely attributed to human embodied experience. Since PLDs isolate body motion as the sole source of meaning, they represent key stimuli for testing the constraints of action understanding in these systems. Here we introduce ActPLD, the first benchmark to evaluate action processing in MLLMs from human PLDs. Tested models include state-of-the-art proprietary and open-source systems on single-actor and socially interacting PLDs. Our results reveal consistently low performance across models, introducing fundamental gaps in action and spatiotemporal understanding.

Problem

Research questions and friction points this paper is trying to address.

Evaluating action understanding in multimodal language models

Testing biological motion perception using point-light displays

Identifying gaps in spatiotemporal reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating action understanding via point-light displays

Testing multimodal models on sparse motion stimuli

Benchmarking biological motion perception in MLLMs

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs