🤖 AI Summary
This work addresses the challenge of efficiently recovering high-resolution, view-consistent geometry and camera poses from uncalibrated multi-view images or video. We propose a dual-stream Transformer architecture that decouples global consistency modeling from fine-detail preservation: a low-resolution stream alternates between frame-wise and global attention to efficiently estimate camera poses and construct a globally consistent representation, while a high-resolution stream processes raw frames individually to retain fine geometric structures. The two streams are fused via lightweight cross-attention adapters. This design enables independent scaling of resolution and sequence length, supporting inputs up to 2K resolution with low inference cost while effectively integrating global context and local detail. Our method achieves state-of-the-art results on video-based geometry estimation and multi-view reconstruction, producing sharp depth maps and point clouds, strong cross-view consistency, and highly accurate camera poses.
📝 Abstract
Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.