ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address critical challenges in single-image-driven upper-body 3D avatar generation—including texture blurriness, motion rigidity, structural instability, and identity drift—this paper proposes a geometry-guided real-time autoregressive video diffusion framework. Methodologically, it introduces 3D reconstruction priors (e.g., SMPL-X pose parameters and UV texture maps) as strong geometric and appearance constraints to jointly guide a lightweight video diffusion model for frame-wise coloring and dynamic synthesis. A geometry-aware spatiotemporal attention mechanism and prior-driven denoising process ensure stable skeletal structure and identity preservation while significantly enhancing high-frequency texture fidelity and temporal motion naturalness. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in visual quality, temporal coherence, and inference speed—achieving real-time rendering at 60 FPS—making it suitable for interactive applications such as gaming and VR.

Technology Category

Application Category

📝 Abstract

Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa

Problem

Research questions and friction points this paper is trying to address.

Generates high-fidelity 3D avatars from single images

Reduces texture blur and motion stiffness in avatars

Prevents structural errors and identity drift in video

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines 3D reconstruction with video diffusion model

Uses 3D priors to guide real-time autoregressive rendering

Unites geometric stability with generative video capabilities

🔎 Similar Papers

Surfel-based Gaussian Inverse Rendering for Fast and Relightable Dynamic Human Reconstruction from Monocular Video