ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing approaches to animating 3D meshes are often hindered by complex setups, low efficiency, or insufficient output quality. This work proposes ActionMesh, the first method to incorporate temporal dynamics into 3D diffusion models by introducing a temporally coherent 3D diffusion framework. Leveraging a temporal 3D autoencoder, ActionMesh enables end-to-end generation of topologically consistent, high-fidelity animated 3D meshes from multimodal inputs—including monocular videos, text prompts, or static meshes paired with text—without requiring skeletal rigging. Evaluated on the Consistent4D and Objaverse benchmarks, the method achieves state-of-the-art performance in both geometric accuracy and temporal consistency, significantly improving generation speed and visual quality. Furthermore, it naturally supports downstream applications such as texture mapping and motion retargeting.

Technology Category

Application Category

📝 Abstract

Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes"in action"in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed"temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.

Problem

Research questions and friction points this paper is trying to address.

animated 3D mesh generation

temporal consistency

3D diffusion

production-ready 3D assets

4D generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal 3D diffusion

animated 3D mesh generation

rig-free animation