AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

High-fidelity text-driven 3D mesh animation generation is hindered by the complexity of spatiotemporal modeling and the scarcity of 4D training data. This work proposes an efficient feedforward framework that first introduces DyMesh-XL, a large-scale and diverse dataset, and then designs DyMeshVAE-Flex—a variational autoencoder enhanced with topology-aware attention and vertex normal conditioning. The framework further integrates rectified flow to support variable-length sequence generation. The resulting method produces semantically accurate and temporally coherent high-quality animations within seconds, significantly outperforming existing approaches across multiple benchmarks and real-world mesh scenarios while achieving an optimal balance between generation quality and inference efficiency.

📝 Abstract

Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. We present AnimateAnyMesh++, a feed-forward framework for text-driven animation of arbitrary 3D meshes with substantial upgrades in data, architecture, and generative capability. First, we expand the DyMesh-XL dataset by mining dynamic content from Objaverse-XL, increasing the number of unique identities from 60K to 300K and substantially broadening category and motion diversity. Second, we redesign DyMeshVAE-Flex with power-law topology-aware attention and vertex-normal enhanced features, which significantly improves trajectory reconstruction, local geometry preservation, and mitigates trajectory-sticking artifacts. Third, we introduce architectural changes to both DyMeshVAE-Flex and the rectified-flow (RF) generator to support variable-length sequence training and generation, enabling longer animations while preserving reconstruction fidelity. Extensive experiments demonstrate that AnimateAnyMesh++ generates semantically accurate and temporally coherent mesh animations within seconds, surpassing prior approaches in quality and efficiency. The enlarged DyMesh-XL, the upgraded DyMeshVAE-Flex, and variable-length RF together deliver consistent gains across benchmarks and in-the-wild meshes. We will release code, models, and the expanded DyMesh-XL upon acceptance of this manuscript to facilitate research in 4D content creation.

Problem

Research questions and friction points this paper is trying to address.

4D content generation

text-driven animation

3D mesh animation

spatio-temporal modeling

4D training data scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

4D animation

text-driven mesh generation

topology-aware attention