🤖 AI Summary
This work addresses the semantic inconsistency in text-to-motion retrieval caused by standard contrastive learning, which treats each textual description as the sole positive sample and ignores the inherent semantic diversity of valid descriptions for the same motion. To resolve this, the authors propose MoCHA, a novel framework that introduces textual normalization as a preprocessing step to map raw descriptions to a canonical form retaining only semantics recoverable from the motion itself. This effectively reduces intra-class variance in text embeddings. The approach supports both rule-based and learned normalizers—including implementations based on GPT-5.2 and distilled FlanT5—without requiring online large language model inference, and is compatible with any retrieval architecture such as MoPa. Experiments show consistent improvements: T2M R@1 increases by 3.1% and 10.3% on HumanML3D and KIT-ML, respectively, intra-class variance drops by 11–19%, and cross-dataset transfer performance improves by up to 94%.
📝 Abstract
Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.