AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Traditional 3D animation synthesis methods are constrained by fixed skeletal topologies or computationally expensive optimization in high-dimensional deformation spaces. AnimaX introduces the first feed-forward video-pose joint diffusion framework, transferring motion priors from video diffusion models to 3D skeletal animation and supporting articulated meshes with arbitrary topology. Its core contributions are: (1) shared spatiotemporal positional encoding and modality-aware embeddings for alignment between input videos and multi-view 2D pose sequences; and (2) an end-to-end generation pipeline conditioned on template-rendered features and text prompts, integrating 3D joint triangulation and inverse kinematics. On the VBench benchmark, AnimaX significantly outperforms existing optimization-based and skeleton-constrained approaches, achieving state-of-the-art performance in generalization, motion fidelity, and inference efficiency—marking the first method enabling category-agnostic, topology-agnostic, high-quality 3D animation generation.

Technology Category

Application Category

📝 Abstract

We present AnimaX, a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation. Traditional motion synthesis methods are either restricted to fixed skeletal topologies or require costly optimization in high-dimensional deformation spaces. In contrast, AnimaX effectively transfers video-based motion knowledge to the 3D domain, supporting diverse articulated meshes with arbitrary skeletons. Our method represents 3D motion as multi-view, multi-frame 2D pose maps, and enables joint video-pose diffusion conditioned on template renderings and a textual motion prompt. We introduce shared positional encodings and modality-aware embeddings to ensure spatial-temporal alignment between video and pose sequences, effectively transferring video priors to motion generation task. The resulting multi-view pose sequences are triangulated into 3D joint positions and converted into mesh animation via inverse kinematics. Trained on a newly curated dataset of 160,000 rigged sequences, AnimaX achieves state-of-the-art results on VBench in generalization, motion fidelity, and efficiency, offering a scalable solution for category-agnostic 3D animation. Project page: href{https://anima-x.github.io/}{https://anima-x.github.io/}.

Problem

Research questions and friction points this paper is trying to address.

Bridges video diffusion motion priors with skeleton animation control

Transfers video-based motion to 3D for diverse articulated meshes

Enables joint video-pose diffusion with text-conditioned motion prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward 3D animation with video-pose diffusion

Multi-view pose maps for 3D motion transfer

Shared encodings for video-pose alignment

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos