Sketch2Motion: Text-driven 2D Sketch to 3D Animation via Diffusion-guided Skeleton Optimization

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of transforming 2D hand-drawn sketches into semantically plausible and temporally coherent 3D animations, which involves difficulties such as occlusion handling and motion mapping. The authors propose a novel approach that integrates classical skeletal animation with deep generative priors: leveraging a text-to-video diffusion model, they employ motion-aware Score Distillation Sampling (MoSDS) to optimize skeletal motion without paired training data. The framework further incorporates linear blend skinning, mass-spring physical simulation, and topological and contact constraints to produce high-quality animations featuring secondary motion effects. To the best of our knowledge, this is the first method to enable text-driven sketch-to-3D-animation generation, significantly outperforming baselines lacking either generative priors or physical constraints across diverse character categories.

📝 Abstract

Animation of 2D hand-drawn sketches provides an effective medium for visual communication. However, these sketches pose challenges, particularly in handling occlusions and accurately mapping motion. While 3D animation naturally addresses these challenges, estimating 3D motion remains a very complex task. Recent approaches to converting 2D sketches to 3D animations have mainly focused on specific types of motion, such as bipedal movements and facial expressions. We propose Sketch2Motion, a diffusion-guided framework for skeleton-based motion synthesis that combines classical character animation pipelines with deep generative priors. Our method represents motion using skeletal transformations, which are propagated to mesh deformations via linear blend skinning. To guide the resulting animation toward realistic and semantically meaningful motion, we integrate a text-to-video diffusion model via motion-aware score-distillation sampling (MoSDS), enabling optimization without paired motion data. Additionally, we apply physics-inspired smoothness, topological, and contact constraints to stabilize optimization and preserve motion plausibility. Further, we integrate a spring-mass simulator to introduce secondary motion effects. The proposed framework is generalized, fully differentiable, modular, and compatible with biped, quadruped, and non-living articulated characters. Experiments demonstrate that our approach produces temporally coherent, text-aligned animations that outperform baseline motion transfer methods that lack generative priors or explicit physical constraints. We will make our code and dataset publicly available.

Problem

Research questions and friction points this paper is trying to address.

2D sketch to 3D animation

motion synthesis

occlusion handling

text-driven animation

skeletal motion estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion-guided optimization

skeleton-based animation

score-distillation sampling