Pay Attention and Move Better: Harnessing Attention for Interactive Motion Generation and Training-free Editing

📅 2024-10-24
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-driven human motion editing methods lack word-level alignment and interpretability, hindering fine-grained control. To address this, we propose MotionCLR—a diffusion-based model that jointly employs self-attention (to capture temporal motion structure) and cross-modal cross-attention (to achieve precise alignment between textual tokens and motion frames). MotionCLR is the first method to enable zero-shot, word-level motion editing without additional training—supporting operations such as emphasis, token replacement, and exemplar-based generation—while further extending to action counting and grounding-aware generation. Its attention maps are both highly interpretable and practically manipulable. Extensive evaluations on multiple benchmarks demonstrate significant improvements in editing accuracy and generation quality, effectively overcoming key bottlenecks in fine-grained text-motion alignment and controllable editing within current diffusion-based frameworks.

Technology Category

Application Category

📝 Abstract
This research delves into the problem of interactive editing of human motion generation. Previous motion diffusion models lack explicit modeling of the word-level text-motion correspondence and good explainability, hence restricting their fine-grained editing ability. To address this issue, we propose an attention-based motion diffusion model, namely MotionCLR, with CLeaR modeling of attention mechanisms. Technically, MotionCLR models the in-modality and cross-modality interactions with self-attention and cross-attention, respectively. More specifically, the self-attention mechanism aims to measure the sequential similarity between frames and impacts the order of motion features. By contrast, the cross-attention mechanism works to find the fine-grained word-sequence correspondence and activate the corresponding timesteps in the motion sequence. Based on these key properties, we develop a versatile set of simple yet effective motion editing methods via manipulating attention maps, such as motion (de-)emphasizing, in-place motion replacement, and example-based motion generation, etc. For further verification of the explainability of the attention mechanism, we additionally explore the potential of action-counting and grounded motion generation ability via attention maps. Our experimental results show that our method enjoys good generation and editing ability with good explainability.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Action Correlation
Lack of Precision
Understanding Difficulty
Innovation

Methods, ideas, or system contributions that make the work stand out.

MotionCLR
Attention Mechanism
Action Editing
🔎 Similar Papers
No similar papers found.