H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In generic multi-view 3D reconstruction, explicit methods achieve high geometric accuracy but suffer from poor robustness in ambiguous regions, whereas implicit methods are robust yet converge slowly—a fundamental trade-off. To bridge this gap, we propose a hybrid explicit-implicit framework. Our key contributions are: (1) coupling volumetric implicit fusion with attention-driven feature aggregation; (2) introducing a camera-aware Transformer and Plücker coordinate encoding to enable geometry-adaptive cross-view correspondence; and (3) integrating epipolar constraints with a spatio-temporal aligned SD-VAE to enhance reconstruction consistency. Evaluated on RealEstate10K, ACID, and DTU, our method achieves PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB, respectively, while accelerating convergence by 2×. It supports variable-length, high-resolution inputs and demonstrates strong cross-dataset generalization.

Technology Category

Application Category

📝 Abstract
Despite recent advances in feed-forward 3D Gaussian Splatting, generalizable 3D reconstruction remains challenging, particularly in multi-view correspondence modeling. Existing approaches face a fundamental trade-off: explicit methods achieve geometric precision but struggle with ambiguous regions, while implicit methods provide robustness but suffer from slow convergence. We present H3R, a hybrid framework that addresses this limitation by integrating volumetric latent fusion with attention-based feature aggregation. Our framework consists of two complementary components: an efficient latent volume that enforces geometric consistency through epipolar constraints, and a camera-aware Transformer that leverages Plücker coordinates for adaptive correspondence refinement. By integrating both paradigms, our approach enhances generalization while converging 2$ imes$ faster than existing methods. Furthermore, we show that spatial-aligned foundation models (e.g., SD-VAE) substantially outperform semantic-aligned models (e.g., DINOv2), resolving the mismatch between semantic representations and spatial reconstruction requirements. Our method supports variable-number and high-resolution input views while demonstrating robust cross-dataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with significant PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code is available at https://github.com/JiaHeng-DLUT/H3R.
Problem

Research questions and friction points this paper is trying to address.

Improving multi-view correspondence for generalizable 3D reconstruction
Balancing geometric precision and robustness in 3D modeling
Resolving semantic-spatial mismatch in reconstruction with foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid volumetric latent fusion with attention
Camera-aware Transformer for adaptive refinement
Spatial-aligned foundation models enhance generalization
🔎 Similar Papers
No similar papers found.
Heng Jia
Heng Jia
Zhejiang University
L
Linchao Zhu
ReLER Lab, CCAI, Zhejiang University; The State Key Lab of Brain-Machine Intelligence, Zhejiang University
N
Na Zhao
Singapore University of Technology and Design