🤖 AI Summary
This study addresses a fundamental limitation of multimodal large language models (MLLMs) in understanding biological motion. We introduce ActPLD, the first MLLM-specific benchmark for point-light action recognition, grounded in minimalistic point-light displays (PLDs)—sparse spatiotemporal sequences of human joint trajectories. ActPLD systematically evaluates MLLMs on two core tasks: single-person action recognition and social interaction understanding. Experiments across leading closed- and open-source MLLMs reveal critical deficiencies in temporal dynamic modeling and embodied semantic parsing of body motion. Beyond establishing the first PLD-driven MLLM evaluation framework, this work pioneers a novel paradigm for assessing action understanding—leveraging embodiment-constrained stimuli to probe spatiotemporal reasoning capabilities. ActPLD thus provides both a rigorous diagnostic tool and concrete guidance for advancing MLLMs’ spatiotemporal reasoning and embodied intelligence.
📝 Abstract
Humans can extract rich semantic information from minimal visual cues, as demonstrated by point-light displays (PLDs), which consist of sparse sets of dots localized to key joints of the human body. This ability emerges early in development and is largely attributed to human embodied experience. Since PLDs isolate body motion as the sole source of meaning, they represent key stimuli for testing the constraints of action understanding in these systems. Here we introduce ActPLD, the first benchmark to evaluate action processing in MLLMs from human PLDs. Tested models include state-of-the-art proprietary and open-source systems on single-actor and socially interacting PLDs. Our results reveal consistently low performance across models, introducing fundamental gaps in action and spatiotemporal understanding.