π€ AI Summary
This work addresses the challenges of single-image front-facing portrait synthesis, which are often hindered by insufficient geometric understanding, distortions in facial and hand details, and the difficulty of achieving real-time inference. To overcome these limitations, we propose the PrismMirror framework, which integrates cascaded coarse-to-fine geometry learning based on SMPL-X meshes and point clouds, rendering-supervised texture refinement, and knowledge distillation into a lightweight linear attention modelβall without relying on external geometric priors. Our method achieves photorealistic reconstruction with high efficiency, marking the first approach to enable real-time performance (24 FPS) for monocular front-view portrait synthesis while significantly outperforming existing methods in both visual fidelity and structural accuracy.
π Abstract
Photorealistic human novel view synthesis from a single image is crucial for democratizing immersive 3D telepresence, eliminating the need for complex multi-camera setups. However, current rendering-centric methods prioritize visual fidelity over explicit geometric understanding and struggle with intricate regions like faces and hands, leading to temporal instability. Meanwhile, human-centric frameworks suffer from memory bottlenecks since they typically rely on an auxiliary model to provide informative structural priors for geometric modeling, which limits real-time performance. To address these challenges, we propose PrismMirror, a geometry-guided framework for instant frontal view synthesis from a single image. By avoiding external geometric modeling and focusing on frontal view synthesis, our model optimizes visual integrity for telepresence. Specifically, PrismMirror introduces a novel cascade learning strategy that enables coarse-to-fine geometric feature learning. It first directly learns coarse geometric features, such as SMPL-X meshes and point clouds, and then refines textures through rendering supervision. To achieve real-time efficiency, we distill this unified framework into a lightweight linear attention model. Notably, PrismMirror is the first monocular human frontal view synthesis model that achieves real-time inference at 24 FPS, significantly outperforming previous methods in both visual authenticity and structural accuracy.