🤖 AI Summary
Existing approaches exhibit a fundamental trade-off: radiance-field-based personalized avatars require multi-view video input and lack cross-identity generalization, while 2D diffusion-based methods offer broader applicability but suffer from low rendering fidelity and insufficient pose-dependent detail modeling (e.g., cloth wrinkles). To address these limitations, we propose the first two-stage diffusion framework operating directly in neural network weight space. Our method jointly optimizes a personalized UNet and a hyper-diffusion model, seamlessly integrating neural radiance field rendering with pre-trained diffusion priors. The framework enables cross-identity, real-time, pose-controllable, high-fidelity avatar generation. Evaluated on a large-scale cross-identity dataset, it achieves significant improvements in visual realism, cloth wrinkle accuracy, and inference efficiency—outperforming all state-of-the-art methods across all key metrics.
📝 Abstract
Creating human avatars is a highly desirable yet challenging task. Recent advancements in radiance field rendering have achieved unprecedented photorealism and real-time performance for personalized dynamic human avatars. However, these approaches are typically limited to person-specific rendering models trained on multi-view video data for a single individual, limiting their ability to generalize across different identities. On the other hand, generative approaches leveraging prior knowledge from pre-trained 2D diffusion models can produce cartoonish, static human avatars, which are animated through simple skeleton-based articulation. Therefore, the avatars generated by these methods suffer from lower rendering quality compared to person-specific rendering methods and fail to capture pose-dependent deformations such as cloth wrinkles. In this paper, we propose a novel approach that unites the strengths of person-specific rendering and diffusion-based generative modeling to enable dynamic human avatar generation with both high photorealism and realistic pose-dependent deformations. Our method follows a two-stage pipeline: first, we optimize a set of person-specific UNets, with each network representing a dynamic human avatar that captures intricate pose-dependent deformations. In the second stage, we train a hyper diffusion model over the optimized network weights. During inference, our method generates network weights for real-time, controllable rendering of dynamic human avatars. Using a large-scale, cross-identity, multi-view video dataset, we demonstrate that our approach outperforms state-of-the-art human avatar generation methods.