Encoder-Free Human Motion Understanding via Structured Motion Descriptions

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limitations of existing large language model (LLM)-based approaches to human motion understanding, which rely on specialized encoders for cross-modal alignment and thereby constrain deep semantic reasoning. The authors propose Structured Motion Description (SMD), a novel framework that, for the first time, translates motion sequences into human-readable natural language text using deterministic biomechanical rules—such as joint angles, body-part movements, and global trajectories. This enables LLMs to directly comprehend and reason about motion semantics without requiring additional encoders. By circumventing conventional cross-modal alignment paradigms, SMD facilitates plug-and-play LLM integration and supports interpretable attention analysis. Experiments demonstrate that SMD achieves state-of-the-art performance, with accuracies of 66.7% and 90.1% on BABEL-QA and HuMMan-QA, respectively, and scores of R@1 = 0.584 and CIDEr = 53.16 on HumanML3D.

Technology Category

Application Category

📝 Abstract

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.

Problem

Research questions and friction points this paper is trying to address.

human motion understanding

motion-language alignment

large language models

motion question answering

motion captioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Motion Description

Encoder-Free

Motion Understanding