🤖 AI Summary
This work addresses the challenge of efficiently and faithfully representing and reconstructing 3D faces with dense semantic correspondences by introducing the CUBE representation. CUBE extends the scalar control points of traditional B-spline volumes to high-dimensional learnable feature grids, substantially enhancing representational capacity while preserving local support. It achieves high-fidelity geometry reconstruction through a two-stage continuous mapping: first generating a base mesh via B-spline basis functions, then refining it with a lightweight MLP that predicts residual displacements. Integrated with a Transformer encoder and template-based mesh sampling, CUBE outperforms existing methods on both 3D scan registration and monocular image-based face reconstruction, while also enabling local editing and establishing dense semantic correspondences across shapes.
📝 Abstract
We present CUBE (Control-based Unified B-spline Encoding), a new geometric representation for human faces that combines B-spline volumes with learned features, and demonstrate its use as a decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-spline representations with 3D control points, CUBE is parametrized by a lattice (e.g., 8 x 8 x 8) of high-dimensional control features, increasing the model's expressivity. These features define a continuous, two-stage mapping from a 3D parametric domain to 3D Euclidean space via an intermediate feature space. First, high-dimensional control features are locally blended using the B-spline bases, yielding a high-dimensional feature vector whose first three values define a 3D base mesh. A small MLP then processes this feature vector to predict a residual displacement from the base shape, yielding the final refined 3D coordinates. To reconstruct 3D surfaces in dense semantic correspondence, CUBE is queried at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support property of traditional B-spline representations, enabling local surface editing by updating individual control features. We demonstrate the strengths of this representation by training transformer-based encoders to predict CUBE's control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent baselines.