Speedy MASt3R

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
MASt3R achieves high-accuracy image matching but incurs a prohibitive per-pair inference latency of 198.16 ms on an A40 GPU, hindering real-time 3D understanding. To address this, we propose a four-pronged acceleration framework: (1) FlashMatch—a tiled cross-image attention mechanism leveraging FlashAttention v2; (2) GraphFusion—TensorRT-driven computational graph fusion with automatic kernel tuning; (3) FastNN-Lite—linear memory access and vectorized block-wise correlation scoring; and (4) HybridCast—an FP16/FP32 mixed-precision inference engine. Our approach preserves original accuracy on benchmarks including Aachen and ScanNet1500 while reducing per-pair latency to 91 ms—a 54% speedup. This work is the first to jointly integrate efficient attention, graph-level optimization, and lightweight approximate nearest-neighbor retrieval, thereby significantly alleviating latency bottlenecks in ViT encoding/decoding and FastNN computation.

Technology Category

Application Category

📝 Abstract
Image matching is a key component of modern 3D vision algorithms, essential for accurate scene reconstruction and localization. MASt3R redefines image matching as a 3D task by leveraging DUSt3R and introducing a fast reciprocal matching scheme that accelerates matching by orders of magnitude while preserving theoretical guarantees. This approach has gained strong traction, with DUSt3R and MASt3R collectively cited over 250 times in a short span, underscoring their impact. However, despite its accuracy, MASt3R's inference speed remains a bottleneck. On an A40 GPU, latency per image pair is 198.16 ms, mainly due to computational overhead from the ViT encoder-decoder and Fast Reciprocal Nearest Neighbor (FastNN) matching. To address this, we introduce Speedy MASt3R, a post-training optimization framework that enhances inference efficiency while maintaining accuracy. It integrates multiple optimization techniques, including FlashMatch-an approach leveraging FlashAttention v2 with tiling strategies for improved efficiency, computation graph optimization via layer and tensor fusion having kernel auto-tuning with TensorRT (GraphFusion), and a streamlined FastNN pipeline that reduces memory access time from quadratic to linear while accelerating block-wise correlation scoring through vectorized computation (FastNN-Lite). Additionally, it employs mixed-precision inference with FP16/FP32 hybrid computations (HybridCast), achieving speedup while preserving numerical precision. Evaluated on Aachen Day-Night, InLoc, 7-Scenes, ScanNet1500, and MegaDepth1500, Speedy MASt3R achieves a 54% reduction in inference time (198 ms to 91 ms per image pair) without sacrificing accuracy. This advancement enables real-time 3D understanding, benefiting applications like mixed reality navigation and large-scale 3D scene reconstruction.
Problem

Research questions and friction points this paper is trying to address.

Improves inference speed of MASt3R for real-time 3D vision tasks.
Reduces computational overhead in image matching for faster scene reconstruction.
Maintains accuracy while optimizing latency in 3D vision algorithms.
Innovation

Methods, ideas, or system contributions that make the work stand out.

FlashMatch leverages FlashAttention v2 for efficiency
GraphFusion optimizes computation with TensorRT auto-tuning
FastNN-Lite reduces memory access time linearly
🔎 Similar Papers
No similar papers found.