🤖 AI Summary
3D avatar reconstruction has long suffered from high computational cost, sensitivity to input quality, and low data utilization efficiency. This paper proposes the first unified, fast, and high-fidelity framework for reconstructing 3D human avatars from single images, multi-view images, or monocular videos. Built upon a novel Large Gaussian Reconstruction Transformer (LGR-Transformer), it introduces a VGGT-based backbone variant, multi-granularity guided encoding, and a progressive Gaussian aggregation mechanism enabling incremental reconstruction. The method jointly optimizes camera pose estimation, FLAME-based facial expression modeling, keypoint tracking, and patch-wise fusion loss, all grounded in a unified 3D Gaussian splatting representation. Experiments demonstrate state-of-the-art reconstruction accuracy, inference times of only a few seconds, and flexible quality–speed trade-offs—significantly enhancing practicality and data efficiency.
📝 Abstract
Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. FastAvatar's core is a Large Gaussian Reconstruction Transformer featuring three key designs: First, a variant VGGT-style transformer architecture aggregating multi-frame cues while injecting initial 3D prompt to predict an aggregatable canonical 3DGS representation; Second, multi-granular guidance encoding (camera pose, FLAME expression, head pose) mitigating animation-induced misalignment for variable-length inputs; Third, incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Integrating these features, FastAvatar enables incremental reconstruction, i.e., improving quality with more observations, unlike prior work wasting input data. This yields a quality-speed-tunable paradigm for highly usable avatar modeling. Extensive experiments show that FastAvatar has higher quality and highly competitive speed compared to existing methods.