Event-T2M: Event-level Conditioning for Complex Text-to-Motion Synthesis

πŸ“… 2026-02-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing text-to-motion generation methods often suffer from missing actions, incorrect sequencing, or unnatural transitions when handling multi-action prompts due to reliance on global embeddings. This work addresses these limitations by formally defining β€œevents” as the fundamental units of motion generation and introducing event-level conditional modeling. Specifically, we employ a motion-aware retrieval model to decompose and encode semantically coherent events, and integrate an event-level cross-attention mechanism within a Conformer architecture to effectively fuse textual and motion representations. To support this paradigm, we construct HumanML3D-E, the first benchmark dataset hierarchically annotated by event count. Experiments demonstrate that our approach achieves state-of-the-art or competitive performance on HumanML3D, KIT-ML, and HumanML3D-E, with notable improvements in temporal fidelity and motion naturalness for multi-event scenarios. User studies further confirm that the generated motions closely resemble real human actions.

Technology Category

Application Category

πŸ“ Abstract
Text-to-motion generation has advanced with diffusion models, yet existing systems often collapse complex multi-action prompts into a single embedding, leading to omissions, reordering, or unnatural transitions. In this work, we shift perspective by introducing a principled definition of an event as the smallest semantically self-contained action or state change in a text prompt that can be temporally aligned with a motion segment. Building on this definition, we propose Event-T2M, a diffusion-based framework that decomposes prompts into events, encodes each with a motion-aware retrieval model, and integrates them through event-based cross-attention in Conformer blocks. Existing benchmarks mix simple and multi-event prompts, making it unclear whether models that succeed on single actions generalize to multi-action cases. To address this, we construct HumanML3D-E, the first benchmark stratified by event count. Experiments on HumanML3D, KIT-ML, and HumanML3D-E show that Event-T2M matches state-of-the-art baselines on standard tests while outperforming them as event complexity increases. Human studies validate the plausibility of our event definition, the reliability of HumanML3D-E, and the superiority of Event-T2M in generating multi-event motions that preserve order and naturalness close to ground-truth. These results establish event-level conditioning as a generalizable principle for advancing text-to-motion generation beyond single-action prompts.
Problem

Research questions and friction points this paper is trying to address.

text-to-motion
multi-action prompts
event-level conditioning
motion synthesis
complex action sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

event-level conditioning
text-to-motion synthesis
diffusion models
multi-action prompts
motion generation
πŸ”Ž Similar Papers
No similar papers found.
S
Seong-Eun Hong
Department of Computer Science and Engineering, Korea University
J
JaeYoung Seon
Department of Artificial Intelligence, Kyung Hee University
J
Juyeong Hwang
Department of Computer Science and Engineering, Korea University
J
JongHwan Shin
Department of Computer Science and Engineering, Korea University
HyeongYeop Kang
HyeongYeop Kang
Assistant Professor of Korea University
Neural Computer GraphicsExtended RealityArtificial IntelligenceHuman-computer Interaction