TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing text-to-motion generation methods struggle to jointly model spatial, temporal, and frequency domains and are highly susceptible to noise, limiting both motion quality and text alignment. To address this, this work proposes TriC-Motion, a diffusion-based framework that introduces tri-domain causal modeling—unifying spatial, temporal, and frequency representations for the first time—and incorporates a causal counterfactual disentanglement mechanism to effectively separate noise from semantic features, thereby enhancing the purity and contribution of each domain. By integrating temporal motion encoding, spatial topology modeling, hybrid frequency analysis, and tri-domain score guidance, TriC-Motion achieves an R@1 of 0.612 on HumanML3D, significantly outperforming existing approaches while generating motions with high fidelity, temporal coherence, diversity, and strong alignment to input text.

Technology Category

Application Category

📝 Abstract

Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model's ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention. TriC-Motion includes three core modeling modules for domain-specific modeling, namely Temporal Motion Encoding, Spatial Topology Modeling, and Hybrid Frequency Analysis. After comprehensive modeling, a Score-guided Tri-domain Fusion module integrates valuable information from the triple domains, simultaneously ensuring temporal consistency, spatial topology, motion trends, and dynamics. Moreover, the Causality-based Counterfactual Motion Disentangler is meticulously designed to expose motion-irrelevant cues to eliminate noise, disentangling the real modeling contributions of each domain for superior generation. Extensive experimental results validate that TriC-Motion achieves superior performance compared to state-of-the-art methods, attaining an outstanding R@1 of 0.612 on the HumanML3D dataset. These results demonstrate its capability to generate high-fidelity, coherent, diverse, and text-aligned motion sequences. Code is available at: https://caoyiyang1105.github.io/TriC-Motion/.

Problem

Research questions and friction points this paper is trying to address.

text-to-motion generation

spatial-temporal-frequency modeling

motion distortion

domain fusion

causal disentanglement

Innovation

Methods, ideas, or system contributions that make the work stand out.

tri-domain modeling

causal intervention

text-to-motion generation