A Study of Finetuning Video Transformers for Multi-view Geometry Tasks

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the generalization capability of generic video foundation models—specifically Video Transformers—for multi-view geometry tasks, including optical flow estimation, monocular depth estimation, and stereo matching, eliminating reliance on task-specific architectures or dedicated pretraining. Methodologically, we demonstrate that the spatiotemporal self-attention mechanism inherent in pretrained Video Transformers intrinsically encodes geometric reasoning capacity. By attaching only a lightweight linear decoder and incorporating an iterative refinement strategy, we achieve efficient fine-tuning. On Sintel (clean/final) and KITTI benchmarks, our approach achieves state-of-the-art optical flow performance: EPEs of 0.69, 1.78, and 3.15 on Sintel clean/final and KITTI 2015, respectively, and online test EPEs of 0.79 and 1.88 with F1-score of 3.79 on KITTI. Strong results on depth and stereo matching further validate the method’s universality and robust cross-task generalization.

Technology Category

Application Category

📝 Abstract
This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal adaptation. The core insight is that general-purpose attention between patches learns temporal and spatial information for geometric reasoning. We demonstrate that appending a linear decoder to the Transformer backbone produces satisfactory results, and iterative refinement can further elevate performance to stateof-the-art levels. This conceptually simple approach achieves top cross-dataset generalization results for optical flow estimation with end-point error (EPE) of 0.69, 1.78, and 3.15 on the Sintel clean, Sintel final, and KITTI datasets, respectively. Our method additionally establishes a new record on the online test benchmark with EPE values of 0.79, 1.88, and F1 value of 3.79. Applications to 3D depth estimation and stereo matching also show strong performance, illustrating the versatility of video-pretrained models in addressing geometric vision tasks.
Problem

Research questions and friction points this paper is trying to address.

Fine-tuning video transformers for multi-view geometry tasks like optical flow.
Using general-purpose pretrained models for geometric reasoning with minimal adaptation.
Achieving state-of-the-art results in optical flow and other geometric vision tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning video foundation models for multi-view geometry tasks
Using general-purpose attention for temporal and spatial geometric reasoning
Appending linear decoder to Transformer backbone for state-of-the-art results
🔎 Similar Papers
No similar papers found.
H
Huimin Wu
The Hong Kong University of Science and Technology
K
Kwang-Ting Cheng
The Hong Kong University of Science and Technology
Stephen Lin
Stephen Lin
Microsoft Research Asia
computer vision
Z
Zhirong Wu
Microsoft Research Asia