ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of real-time inference of Photorealistic Codec Avatars (PCAs) on resource-constrained VR devices, this paper proposes ESCA, an algorithm–hardware co-optimization framework. Methodologically, ESCA introduces a dedicated post-training low-bit quantization scheme tailored for PCA models, integrated with a customized hardware accelerator and perceptually guided quality evaluation using FovVideoVDP. In terms of contributions and results, ESCA achieves the first full-stack optimization of PCAs—balancing high fidelity and high efficiency. Compared to the state-of-the-art 4-bit baseline, it improves FovVideoVDP by 0.39 and reduces inference latency by up to 3.36×. End-to-end measurements demonstrate sustained throughput of 100 fps, satisfying the stringent low-latency and high-frame-rate requirements for immersive VR interaction.

Technology Category

Application Category

📝 Abstract
Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays, where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to $+0.39$ over the best 4-bit baseline, delivers up to $3.36 imes$ latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational demands of photorealistic avatars on VR devices
Enabling real-time inference for codec avatars with limited resources
Optimizing hardware and algorithms for efficient VR avatar execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training quantization for low-precision avatar execution
Custom hardware accelerator for VR system-on-chip integration
Full-stack optimization framework for real-time edge inference
🔎 Similar Papers
No similar papers found.
M
Mingzhi Zhu
Tandon School of Engineering, New York University
D
Ding Shang
Tandon School of Engineering, New York University
Sai Qian Zhang
Sai Qian Zhang
New York University