FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views

πŸ“… 2026-04-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing approaches that typically handle 3D geometric reconstruction and semantic understanding separately, relying on strong supervisory signals such as camera poses, depth maps, or semantic labels, which often lead to redundant pipelines and error accumulation. The authors propose an end-to-end feedforward framework that requires no annotations, leveraging only multi-view RGB images and rendering-based supervision to jointly infer geometry and semantics. By introducing a Token-wise Fusion Module to enrich geometric tokens with semantic context and a Semantic-Geometry Mutual Boosting mechanism to co-optimize global semantic consistency and local structural coherence, the method integrates cross-attention, geometry-guided feature warping, and semantic-aware voxelization. It achieves state-of-the-art performance on ScanNet and DL3DV-10K, excelling in novel view synthesis, open-vocabulary semantic segmentation, depth estimation, and demonstrating strong generalization to in-the-wild scenes.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in vision foundation models have revolutionized geometry reconstruction and semantic understanding. Yet, most of the existing approaches treat these capabilities in isolation, leading to redundant pipelines and compounded errors. This paper introduces FF3R, a fully annotation-free feed-forward framework that unifies geometric and semantic reasoning from unconstrained multi-view image sequences. Unlike previous methods, FF3R does not require camera poses, depth maps, or semantic labels, relying solely on rendering supervision for RGB and feature maps, establishing a scalable paradigm for unified 3D reasoning. In addition, we address two critical challenges in feedforward feature reconstruction pipelines, namely global semantic inconsistency and local structural inconsistency, through two key innovations: (i) a Token-wise Fusion Module that enriches geometry tokens with semantic context via cross-attention, and (ii) a Semantic-Geometry Mutual Boosting mechanism combining geometry-guided feature warping for global consistency with semantic-aware voxelization for local coherence. Extensive experiments on ScanNet and DL3DV-10K demonstrate FF3R's superior performance in novel-view synthesis, open-vocabulary semantic segmentation, and depth estimation, with strong generalization to in-the-wild scenarios, paving the way for embodied intelligence systems that demand both spatial and semantic understanding.
Problem

Research questions and friction points this paper is trying to address.

3D reconstruction
semantic understanding
unconstrained views
global semantic inconsistency
local structural inconsistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

feedforward 3D reconstruction
unconstrained views
semantic-geometry unification
annotation-free learning
feature fusion
πŸ”Ž Similar Papers
No similar papers found.