🤖 AI Summary
This work addresses motion information distortion in human motion instruction tuning caused by linguistic tokenization of motion data. We propose LLaMo, a multimodal framework that natively encodes raw human motion sequences—such as 3D joint trajectories—bypassing symbolic tokenization, and jointly trains on video, motion, and text via end-to-end instruction tuning. Its core contribution is the first elimination of language-based motion representation within the instruction-tuning paradigm, achieved through a multimodal fusion architecture, a native motion encoder, and cross-modal alignment training for unified video–motion–language understanding. Experiments demonstrate significant performance gains on challenging tasks including fine-grained behavior recognition and domain-specific action parsing. The code and models are publicly released to support real-world applications in sports analytics, human–robot interaction, and embodied AI.
📝 Abstract
This paper presents LLaMo (Large Language and Human Motion Assistant), a multimodal framework for human motion instruction tuning. In contrast to conventional instruction-tuning approaches that convert non-linguistic inputs, such as video or motion sequences, into language tokens, LLaMo retains motion in its native form for instruction tuning. This method preserves motion-specific details that are often diminished in tokenization, thereby improving the model's ability to interpret complex human behaviors. By processing both video and motion data alongside textual inputs, LLaMo enables a flexible, human-centric analysis. Experimental evaluations across high-complexity domains, including human behaviors and professional activities, indicate that LLaMo effectively captures domain-specific knowledge, enhancing comprehension and prediction in motion-intensive scenarios. We hope LLaMo offers a foundation for future multimodal AI systems with broad applications, from sports analytics to behavioral prediction. Our code and models are available on the project website: https://github.com/ILGLJ/LLaMo.