Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing feedforward 3D reconstruction models suffer from blurred geometric details and poor robustness due to the absence of high-fidelity depth/pose supervision and geometric misalignment in multi-view point cloud regression. To address this, we propose Fin3R—a lightweight fine-tuning framework that enhances the image encoder via monocular knowledge distillation: the pre-trained decoder is frozen, while Low-Rank Adaptation (LoRA) adapters are injected solely into the encoder to distill fine-grained geometric priors from a monocular teacher model, leveraging large-scale unlabeled monocular images. This approach requires no multi-view annotations and effectively mitigates inter-view geometric inconsistency. Evaluated on DUSt3R and MASt3R, Fin3R significantly improves boundary sharpness and recovery of complex structures, yielding higher geometric accuracy—while incurring negligible increases in inference memory footprint or latency.

Technology Category

Application Category

📝 Abstract

We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to ( extit{i}) the scarcity of high-fidelity depth and pose supervision and ( extit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder-the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: href{http://visual-ai.github.io/fin3r}{https://visual-ai.github.io/fin3r}

Problem

Research questions and friction points this paper is trying to address.

Improves fine geometry details in feed-forward 3D reconstruction models

Addresses scarcity of high-fidelity depth and pose supervision

Resolves geometric misalignment from multi-view pointmap regression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes encoder with monocular teacher knowledge distillation

Uses lightweight LoRA adapter for efficiency

Freezes decoder to preserve view matching functionality

🔎 Similar Papers

MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection