A Unified Conditional Flow for Motion Generation, Editing, and Intra-Structural Retargeting

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the disconnect between text-driven motion editing and intra-topology motion retargeting, which are conventionally treated as separate tasks, leading to incompatible input representations and fragmented deployment. We propose a unified modeling paradigm that frames both tasks as distinct instances of conditional generation within a single framework, enabling multitask learning through modulation of semantic or structural conditions. Built upon rectified flow matching, our approach introduces a DiT-style Transformer that jointly conditions on text and skeletal structure, featuring joint tokenization, explicit joint self-attention, and a multi-condition classifier-free guidance strategy. Evaluated on SnapMoGen and Mixamo datasets, the single model simultaneously achieves text-to-motion generation, zero-shot editing, and zero-shot intra-topology retargeting, outperforming task-specific baselines while significantly improving structural consistency and simplifying system deployment.

Technology Category

Application Category

📝 Abstract

Text-driven motion editing and intra-structural retargeting, where source and target share topology but may differ in bone lengths, are traditionally handled by fragmented pipelines with incompatible inputs and representations: editing relies on specialized generative steering, while retargeting is deferred to geometric post-processing. We present a unifying perspective where both tasks are cast as instances of conditional transport within a single generative framework. By leveraging recent advances in flow matching, we demonstrate that editing and retargeting are fundamentally the same generative task, distinguished only by which conditioning signal, semantic or structural, is modulated during inference. We implement this vision via a rectified-flow motion model jointly conditioned on text prompts and target skeletal structures. Our architecture extends a DiT-style transformer with per-joint tokenization and explicit joint self-attention to strictly enforce kinematic dependencies, while a multi-condition classifier-free guidance strategy balances text adherence with skeletal conformity. Experiments on SnapMoGen and a multi-character Mixamo subset show that a single trained model supports text-to-motion generation, zero-shot editing, and zero-shot intra-structural retargeting. This unified approach simplifies deployment and improves structural consistency compared to task-specific baselines.

Problem

Research questions and friction points this paper is trying to address.

motion editing

intra-structural retargeting

conditional generation

skeletal animation

text-driven motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

conditional flow

motion editing

intra-structural retargeting