π€ AI Summary
This work proposes Mon3tr, a novel framework for mobile immersive telepresence that overcomes the limitations of existing systems, which typically rely on multi-camera setups and high-bandwidth volumetric data transmission, hindering real-time performance on mobile devices. Mon3tr introduces the first use of 3D Gaussian Splatting (3DGS) for parametric human modeling, combined with an amortized inference strategy: a user-specific 3DGS avatar is constructed offline, and during online operation, only a monocular RGB video stream is required to drive real-time pose and expression synthesis. This approach drastically reduces hardware and bandwidth demands, achieving approximately 60 FPS rendering on devices such as the Meta Quest 3, with an end-to-end latency of about 80 ms and bandwidth consumption below 0.2 Mbpsβover 1000Γ lower than point cloud streaming. The method attains a PSNR exceeding 28 dB under novel poses, enabling high-quality, low-overhead mobile 3D telepresence.
π Abstract
Immersive telepresence aims to transform human interaction in AR/VR applications by enabling lifelike full-body holographic representations for enhanced remote collaboration. However, existing systems rely on hardware-intensive multi-camera setups and demand high bandwidth for volumetric streaming, limiting their real-time performance on mobile devices. To overcome these challenges, we propose Mon3tr, a novel Monocular 3D telepresence framework that integrates 3D Gaussian splatting (3DGS) based parametric human modeling into telepresence for the first time. Mon3tr adopts an amortized computation strategy, dividing the process into a one-time offline multi-view reconstruction phase to build a user-specific avatar and a monocular online inference phase during live telepresence sessions. A single monocular RGB camera is used to capture body motions and facial expressions in real time to drive the 3DGS-based parametric human model, significantly reducing system complexity and cost. The extracted motion and appearance features are transmitted at<0.2 Mbps over WebRTC's data channel, allowing robust adaptation to network fluctuations. On the receiver side, e.g., Meta Quest 3, we develop a lightweight 3DGS attribute deformation network to dynamically generate corrective 3DGS attribute adjustments on the pre-built avatar, synthesizing photorealistic motion and appearance at ~ 60 FPS. Extensive experiments demonstrate the state-of-the-art performance of our method, achieving a PSNR of>28 dB for novel poses, an end-to-end latency of ~ 80 ms, and>1000x bandwidth reduction compared to point-cloud streaming, while supporting real-time operation from monocular inputs across diverse scenarios. Our demos can be found at https://mon3tr3d.github.io.