Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
While existing multi-frame models achieve cross-frame consistency, their single-frame accuracy often lags behind that of single-frame methods. Through systematic ablation studies, this work demonstrates that data diversity and quality are critical for 3D geometry estimation and reveals that commonly used loss functions may inadvertently suppress performance. To address these issues, the authors propose CARVE, a novel approach integrating a high-resolution network architecture, joint sequence- and frame-level supervision, a consistency loss, and alignment between depth maps and camera parameters. CARVE achieves state-of-the-art and robust performance across multiple benchmarks in tasks including point cloud reconstruction, video depth estimation, and estimation of camera pose and intrinsics.

Technology Category

Application Category

📝 Abstract
Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.
Problem

Research questions and friction points this paper is trying to address.

visual geometry estimation
multi-frame consistency
single-frame accuracy
3D reconstruction
depth estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

consistency loss
high-resolution architecture
visual geometry estimation
ablation study
multi-frame consistency
🔎 Similar Papers
No similar papers found.