Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenge of efficiently reconstructing complete 3D geometry from unstructured image collections by proposing a feed-forward approach that, for the first time, jointly extracts dense signed distance fields (SDFs) directly from intermediate features of a pre-trained multi-view geometric Transformer. A unified 3D implicit representation is constructed through voxelized embeddings combined with cross- and self-attention mechanisms, followed by a lightweight convolutional decoder to produce the full SDF. The method eliminates conventional per-view prediction and post-hoc fusion pipelines, instead introducing an occupancy-aware SDF supervision strategy that accommodates real-world scenarios such as non-watertight meshes. It achieves geometrically consistent and complete reconstructions under both sparse and dense input views, with inference times under three seconds—significantly outperforming existing fusion-based approaches.

Technology Category

Application Category

📝 Abstract

We propose a feed-forward method for dense Signed Distance Field (SDF) regression from unstructured image collections in less than three seconds, without camera calibration or post-hoc fusion. Our key insight is that the intermediate feature space of pretrained multi-view feed-forward geometry transformers already encodes a powerful joint world representation; yet, existing pipelines discard it, routing features through per-view prediction heads before assembling 3D geometry post-hoc, which discards valuable completeness information and accumulates inaccuracies. We instead perform 3D extraction directly from geometry transformer features via learned volumetric extraction: voxelized canonical embeddings that progressively absorb multi-view geometry information through interleaved cross- and self-attention into a structured volumetric latent grid. A simple convolutional decoder then maps this grid to a dense SDF. We additionally propose a scalable, validity-aware supervision scheme directly using SDFs derived from depth maps or 3D assets, tackling practical issues like non-watertight meshes. Our approach yields complete and well-defined distance values across sparse- and dense-view settings and demonstrates geometrically plausible completions. Code and further material can be found at https://lorafib.github.io/fus3d.

Problem

Research questions and friction points this paper is trying to address.

3D reconstruction

Signed Distance Field

multi-view geometry

feed-forward transformer

dense geometry regression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward Geometry Transformer

Signed Distance Field (SDF)

Volumetric Latent Grid