A Unified Conditional Flow for Motion Generation, Editing, and Intra-Structural Retargeting

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the disconnect between text-driven motion editing and intra-topology motion retargeting, which are conventionally treated as separate tasks, leading to incompatible input representations and fragmented deployment. We propose a unified modeling paradigm that frames both tasks as distinct instances of conditional generation within a single framework, enabling multitask learning through modulation of semantic or structural conditions. Built upon rectified flow matching, our approach introduces a DiT-style Transformer that jointly conditions on text and skeletal structure, featuring joint tokenization, explicit joint self-attention, and a multi-condition classifier-free guidance strategy. Evaluated on SnapMoGen and Mixamo datasets, the single model simultaneously achieves text-to-motion generation, zero-shot editing, and zero-shot intra-topology retargeting, outperforming task-specific baselines while significantly improving structural consistency and simplifying system deployment.

Technology Category

Application Category

📝 Abstract
Text-driven motion editing and intra-structural retargeting, where source and target share topology but may differ in bone lengths, are traditionally handled by fragmented pipelines with incompatible inputs and representations: editing relies on specialized generative steering, while retargeting is deferred to geometric post-processing. We present a unifying perspective where both tasks are cast as instances of conditional transport within a single generative framework. By leveraging recent advances in flow matching, we demonstrate that editing and retargeting are fundamentally the same generative task, distinguished only by which conditioning signal, semantic or structural, is modulated during inference. We implement this vision via a rectified-flow motion model jointly conditioned on text prompts and target skeletal structures. Our architecture extends a DiT-style transformer with per-joint tokenization and explicit joint self-attention to strictly enforce kinematic dependencies, while a multi-condition classifier-free guidance strategy balances text adherence with skeletal conformity. Experiments on SnapMoGen and a multi-character Mixamo subset show that a single trained model supports text-to-motion generation, zero-shot editing, and zero-shot intra-structural retargeting. This unified approach simplifies deployment and improves structural consistency compared to task-specific baselines.
Problem

Research questions and friction points this paper is trying to address.

motion editing
intra-structural retargeting
conditional generation
skeletal animation
text-driven motion
Innovation

Methods, ideas, or system contributions that make the work stand out.

conditional flow
motion editing
intra-structural retargeting
flow matching
DiT transformer
🔎 Similar Papers