🤖 AI Summary
To address the challenge of real-time inference of Photorealistic Codec Avatars (PCAs) on resource-constrained VR devices, this paper proposes ESCA, an algorithm–hardware co-optimization framework. Methodologically, ESCA introduces a dedicated post-training low-bit quantization scheme tailored for PCA models, integrated with a customized hardware accelerator and perceptually guided quality evaluation using FovVideoVDP. In terms of contributions and results, ESCA achieves the first full-stack optimization of PCAs—balancing high fidelity and high efficiency. Compared to the state-of-the-art 4-bit baseline, it improves FovVideoVDP by 0.39 and reduces inference latency by up to 3.36×. End-to-end measurements demonstrate sustained throughput of 100 fps, satisfying the stringent low-latency and high-frame-rate requirements for immersive VR interaction.
📝 Abstract
Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays, where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to $+0.39$ over the best 4-bit baseline, delivers up to $3.36 imes$ latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.