Déjà View: Looping Transformers for Multi-View 3D Reconstruction

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the high parameter count, computational inefficiency, and structural redundancy of conventional feedforward multi-view 3D reconstruction Transformer models. The authors propose Déjà View, a novel recurrent architecture that employs a single Transformer block iteratively applied K times at the feature level to explicitly model the progressive refinement inherent in the reconstruction process. By replacing implicit deep stacking with an explicit recurrence mechanism, this design significantly reduces both model parameters and computational overhead while introducing a stronger inductive bias. Evaluated across five diverse 3D reconstruction benchmarks—spanning indoor, outdoor, object-centric, and driving scenarios—Déjà View achieves comparable or superior reconstruction accuracy relative to baseline methods, despite using substantially fewer parameters and comparable or lower computational costs.

📝 Abstract

Recent feed-forward 3D reconstruction transformers have scaled to over a billion parameters, following the broader trend of increasing model capacity in computer vision. Yet emerging evidence suggests that contiguous transformer layers often behave like repeated applications of similar operations, and multi-view reconstruction transformers refine their predictions progressively across decoder depth. We posit that model depth partially buys iteration, paid for inefficiently in unique parameters, and instead make that iteration explicit in architecture. Our model, DéjàView, applies a single looped transformer block recurrently to per-view features for K refinement steps. Trained once, it exposes K as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor, object-centric, and driving scenes, while using a fraction of their parameters and comparable or lower compute. Importantly, the same looped block formulation outperforms an otherwise identical variant with independent per-step parameters under matched training data and compute, suggesting that explicit iteration is not merely a compute-efficient substitute for capacity but a stronger inductive bias for multi-view 3D reconstruction.

Problem

Research questions and friction points this paper is trying to address.

multi-view 3D reconstruction

transformer

model efficiency

parameter redundancy

iterative refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

looped transformer

iterative refinement

multi-view 3D reconstruction