DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of efficiently recovering high-resolution, view-consistent geometry and camera poses from uncalibrated multi-view images or video. We propose a dual-stream Transformer architecture that decouples global consistency modeling from fine-detail preservation: a low-resolution stream alternates between frame-wise and global attention to efficiently estimate camera poses and construct a globally consistent representation, while a high-resolution stream processes raw frames individually to retain fine geometric structures. The two streams are fused via lightweight cross-attention adapters. This design enables independent scaling of resolution and sequence length, supporting inputs up to 2K resolution with low inference cost while effectively integrating global context and local detail. Our method achieves state-of-the-art results on video-based geometry estimation and multi-view reconstruction, producing sharp depth maps and point clouds, strong cross-view consistency, and highly accurate camera poses.

Technology Category

Application Category

📝 Abstract

Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.

Problem

Research questions and friction points this paper is trying to address.

geometry estimation

camera pose estimation

multi-view reconstruction

view consistency

high-resolution depth

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-stream architecture

geometry estimation

view consistency