Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work proposes MOCHI, a novel framework for learning-based multi-view 3D face reconstruction that eliminates the need for time-consuming manual registration data and enables fully end-to-end training. MOCHI achieves this by incorporating a pseudo-linear inverse kinematics solver to enforce topological consistency, leveraging a 2D dense landmark predictor—trained solely on synthetic data—to guide semantic alignment, and replacing conventional point-to-surface distance with point-graph and normal-based losses to enhance training stability. Additionally, it introduces a test-time weight fine-tuning mechanism that improves reconstruction accuracy without compromising efficiency. Experiments demonstrate that MOCHI surpasses traditional hand-crafted dense registration pipelines in both reconstruction accuracy and visual quality, while removing reliance on real-world registered data.

📝 Abstract

Recent frameworks like ToFu and TEMPEH provide an automated alternative to classical registration pipelines by predicting 3D meshes in dense semantic correspondence directly from calibrated multi-view images. However, these learning-based methods rely on the slow, manual registration pipelines they aim to replace for their training supervision. We overcome this limitation with MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a multi-view 3D face prediction framework trained without requiring registered training data. MOCHI eliminates the registration data dependency by enforcing topological consistency through a pseudo-linear inverse kinematic solver. Semantic alignment is guided by dense keypoints from a 2D landmark predictor trained exclusively on synthetic data. Our analysis further reveals that standard point-to-surface distances induce training instabilities and visual artifacts in registration-free settings. We propose pointmap- and normal-based losses instead, which provide smoother gradients and superior reconstruction fidelity. Finally, we introduce a test-time optimization scheme that refines network weights over a few dozen iterations. This approach bridges the gap between feed-forward efficiency and iterative optimization precision, allowing MOCHI to outperform traditional labor-intensive pipelines in both reconstruction accuracy and visual quality. Code and model are public at: https://filby89.github.io/mochi.

Problem

Research questions and friction points this paper is trying to address.

registration-free

multi-view face reconstruction

dense semantic correspondence

3D face prediction

training data dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

registration-free

dense semantic correspondence

multi-view 3D face reconstruction