T2Bs: Text-to-Character Blendshapes via Video Generation

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses two key limitations in text-driven 4D facial modeling: (1) static text-to-3D methods lack motion synthesis capability, and (2) video diffusion models suffer from temporal inconsistency and multi-view geometric misalignment. To this end, we propose a cross-modal generative framework that jointly enforces static geometric constraints and dynamic motion modeling. Given only text input, our method integrates text-to-3D generation with video diffusion priors to construct a deformable 3D Gaussian splatting representation. A view-dependent MLP deformation network enables co-optimization of geometry and motion, while cross-modal alignment and temporal consistency regularization ensure high-fidelity, low-artifact, and multi-view-consistent 4D deformations. Experiments demonstrate that our approach significantly outperforms existing 4D generation methods in geometric fidelity, motion naturalness, and view consistency. Moreover, it supports fully registered, high-fidelity, and real-time animatable head modeling.

Technology Category

Application Category

📝 Abstract

We present T2Bs, a framework for generating high-quality, animatable character head morphable models from text by combining static text-to-3D generation with video diffusion. Text-to-3D models produce detailed static geometry but lack motion synthesis, while video diffusion models generate motion with temporal and multi-view geometric inconsistencies. T2Bs bridges this gap by leveraging deformable 3D Gaussian splatting to align static 3D assets with video outputs. By constraining motion with static geometry and employing a view-dependent deformation MLP, T2Bs (i) outperforms existing 4D generation methods in accuracy and expressiveness while reducing video artifacts and view inconsistencies, and (ii) reconstructs smooth, coherent, fully registered 3D geometries designed to scale for building morphable models with diverse, realistic facial motions. This enables synthesizing expressive, animatable character heads that surpass current 4D generation techniques.

Problem

Research questions and friction points this paper is trying to address.

Generating animatable 3D character heads from text

Bridging static 3D geometry with motion synthesis

Reducing artifacts and inconsistencies in 4D generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining text-to-3D with video diffusion

Using deformable 3D Gaussian splatting alignment

Employing view-dependent deformation MLP

🔎 Similar Papers

No similar papers found.