π€ AI Summary
Existing motion-language retrieval methods rely on global alignment, overlooking fine-grained interactions among local motion segments, body joints, and textual tokens, which limits their performance. Inspired by the pyramidal mechanism of human motion perception, this work proposes a Pyramidal Shapley-Taylor learning framework that introduces, for the first time, a pyramid-based fine-grained alignment strategy. By integrating Shapley-Taylor interaction modeling with spatiotemporal decomposition and hierarchical alignment, the framework enables precise cross-modal matching between joints and tokens as well as between motion segments and textual phrases. The method significantly outperforms state-of-the-art approaches on multiple public benchmarks, effectively capturing both local semantic correspondences and hierarchical structural relationships between motion and language.
π Abstract
As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.