Understanding Multi-View Transformers

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-view Transformers (e.g., DUSt3R) achieve strong performance in 3D vision but remain poorly understood, hindering interpretability and safe deployment. To address this, we propose the first layer-wise interpretability analysis framework for multi-view Transformers, leveraging residual connection feature probing and geometry-aware visualization to systematically uncover the evolution of internal 3D representations. Our analysis reveals that hidden states progressively encode 3D structure: early layers focus on local correspondences, while deeper layers refine these via geometric reconstruction—enabling end-to-end pose estimation without explicit global pose modeling. This work is the first to elucidate *how* and *why* such models succeed, breaking the “black-box” barrier and providing theoretical foundations for architecture design and reliability validation. Code is publicly available.

Technology Category

Application Category

📝 Abstract
Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers' layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .
Problem

Research questions and friction points this paper is trying to address.

Analyzing inner mechanisms of multi-view transformers like DUSt3R
Visualizing 3D representations from transformer residual connections
Investigating latent state development and layer roles in transformers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probes residual connections in transformer layers
Visualizes latent 3D representations across blocks
Analyzes correspondence refinement through reconstructed geometry
🔎 Similar Papers
No similar papers found.