JOintGS: Joint Optimization of Cameras, Bodies and 3D Gaussians for In-the-Wild Monocular Reconstruction

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of high-fidelity animatable 3D human reconstruction from unconstrained monocular videos, where inaccurate camera parameters and pose estimates severely hinder performance. We propose a unified framework that explicitly decouples foreground and background representations to jointly optimize camera extrinsics, human pose, and a 3D Gaussian-based geometry in a co-refinement manner. By integrating temporal dynamics modeling, a residual color field, multi-view consistency constraints, and illumination-aware appearance modeling, our method significantly enhances robustness to noisy initializations. Experiments on the NeuMan and EMDB datasets demonstrate a consistent improvement of 2.1 dB in PSNR over state-of-the-art approaches, while enabling real-time rendering and exhibiting greater tolerance to initial estimation errors.

Technology Category

Application Category

📝 Abstract
Reconstructing high-fidelity animatable 3D human avatars from monocular RGB videos remains challenging, particularly in unconstrained in-the-wild scenarios where camera parameters and human poses from off-the-shelf methods (e.g., COLMAP, HMR2.0) are often inaccurate. Splatting (3DGS) advances demonstrate impressive rendering quality and real-time performance, they critically depend on precise camera calibration and pose annotations, limiting their applicability in real-world settings. We present JOintGS, a unified framework that jointly optimizes camera extrinsics, human poses, and 3D Gaussian representations from coarse initialization through a synergistic refinement mechanism. Our key insight is that explicit foreground-background disentanglement enables mutual reinforcement: static background Gaussians anchor camera estimation via multi-view consistency; refined cameras improve human body alignment through accurate temporal correspondence; optimized human poses enhance scene reconstruction by removing dynamic artifacts from static constraints. We further introduce a temporal dynamics module to capture fine-grained pose-dependent deformations and a residual color field to model illumination variations. Extensive experiments on NeuMan and EMDB datasets demonstrate that JOintGS achieves superior reconstruction quality, with 2.1~dB PSNR improvement over state-of-the-art methods on NeuMan dataset, while maintaining real-time rendering. Notably, our method shows significantly enhanced robustness to noisy initialization compared to the baseline.Our source code is available at https://github.com/MiliLab/JOintGS.
Problem

Research questions and friction points this paper is trying to address.

monocular reconstruction
3D human avatars
in-the-wild
camera calibration
pose estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint Optimization
3D Gaussian Splatting
Monocular Reconstruction
Foreground-Background Disentanglement
Temporal Dynamics
🔎 Similar Papers
No similar papers found.