T2Bs: Text-to-Character Blendshapes via Video Generation

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key limitations in text-driven 4D facial modeling: (1) static text-to-3D methods lack motion synthesis capability, and (2) video diffusion models suffer from temporal inconsistency and multi-view geometric misalignment. To this end, we propose a cross-modal generative framework that jointly enforces static geometric constraints and dynamic motion modeling. Given only text input, our method integrates text-to-3D generation with video diffusion priors to construct a deformable 3D Gaussian splatting representation. A view-dependent MLP deformation network enables co-optimization of geometry and motion, while cross-modal alignment and temporal consistency regularization ensure high-fidelity, low-artifact, and multi-view-consistent 4D deformations. Experiments demonstrate that our approach significantly outperforms existing 4D generation methods in geometric fidelity, motion naturalness, and view consistency. Moreover, it supports fully registered, high-fidelity, and real-time animatable head modeling.

Technology Category

Application Category

📝 Abstract
We present T2Bs, a framework for generating high-quality, animatable character head morphable models from text by combining static text-to-3D generation with video diffusion. Text-to-3D models produce detailed static geometry but lack motion synthesis, while video diffusion models generate motion with temporal and multi-view geometric inconsistencies. T2Bs bridges this gap by leveraging deformable 3D Gaussian splatting to align static 3D assets with video outputs. By constraining motion with static geometry and employing a view-dependent deformation MLP, T2Bs (i) outperforms existing 4D generation methods in accuracy and expressiveness while reducing video artifacts and view inconsistencies, and (ii) reconstructs smooth, coherent, fully registered 3D geometries designed to scale for building morphable models with diverse, realistic facial motions. This enables synthesizing expressive, animatable character heads that surpass current 4D generation techniques.
Problem

Research questions and friction points this paper is trying to address.

Generating animatable 3D character heads from text
Bridging static 3D geometry with motion synthesis
Reducing artifacts and inconsistencies in 4D generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining text-to-3D with video diffusion
Using deformable 3D Gaussian splatting alignment
Employing view-dependent deformation MLP
🔎 Similar Papers
No similar papers found.
J
Jiahao Luo
University of California, Santa Cruz
C
Chaoyang Wang
Snap Inc.
Michael Vasilkovsky
Michael Vasilkovsky
Snap Inc.
Computer GraphicsGenerative AIWorld Models
V
Vladislav Shakhrai
Snap Inc.
D
Di Liu
Rutgers University
P
Peiye Zhuang
Snap Inc.
Sergey Tulyakov
Sergey Tulyakov
Director of Research, Snap Inc.
computer visionmachine learning
Peter Wonka
Peter Wonka
King Abdullah University of Science and Technology (KAUST)
Deep LearningComputer VisionComputer GraphicsMachine LearningRemote Sensing
Hsin-Ying Lee
Hsin-Ying Lee
stealth mode startup
Computer Vision
J
James Davis
University of California, Santa Cruz
J
Jian Wang
Snap Inc.