Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning

πŸ“… 2026-01-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing motion-language retrieval methods rely on global alignment, overlooking fine-grained interactions among local motion segments, body joints, and textual tokens, which limits their performance. Inspired by the pyramidal mechanism of human motion perception, this work proposes a Pyramidal Shapley-Taylor learning framework that introduces, for the first time, a pyramid-based fine-grained alignment strategy. By integrating Shapley-Taylor interaction modeling with spatiotemporal decomposition and hierarchical alignment, the framework enables precise cross-modal matching between joints and tokens as well as between motion segments and textual phrases. The method significantly outperforms state-of-the-art approaches on multiple public benchmarks, effectively capturing both local semantic correspondences and hierarchical structural relationships between motion and language.

Technology Category

Application Category

πŸ“ Abstract
As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

motion-language retrieval
fine-grained alignment
semantic gap
cross-modal correspondence
human motion perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained retrieval
pyramidal alignment
Shapley-Taylor interaction
motion-language grounding
cross-modal correspondence
H
Hanmo Chen
Hangzhou Institute of Technology, Xidian University, Hangzhou, China
G
Guangtao Lyu
Xidian University, Xi’an, China
Chenghao Xu
Chenghao Xu
EPFL
RoboticsDynamic SLAMActive Vision
J
Jiexi Yan
Xidian University, Xi’an, China
X
Xu Yang
Xidian University, Xi’an, China
Cheng Deng
Cheng Deng
University of Edinburgh
On-device LLMNLPGeoAI