🤖 AI Summary
This work addresses 3D scene reconstruction from maritime aerial videos. We introduce MTReD, the first benchmark dataset tailored for marine environments—comprising 19 internet-sourced flyover videos featuring ships, islands, and coastlines—and support joint evaluation of geometric consistency and visual fidelity. To overcome the insensitivity of standard metrics (e.g., LPIPS) to reconstruction completeness, we propose DiFPS, a perception-based similarity metric leveraging DINOv2 features. Furthermore, we identify a preprocessing strategy that jointly improves reprojection accuracy and perceptual quality. Our method integrates Structure-from-Motion (via COLMAP), monocular 3D reconstruction (MASt3R), DINOv2 semantic features, and a custom evaluation framework. Experiments show that MASt3R achieves superior perceptual scores over traditional methods but suffers from higher reprojection error; our preprocessing significantly enhances the trade-off between these objectives. Both code and the MTReD dataset are publicly released.
📝 Abstract
This work tackles 3D scene reconstruction for a video fly-over perspective problem in the maritime domain, with a specific emphasis on geometrically and visually sound reconstructions. This will allow for downstream tasks such as segmentation, navigation, and localization. To our knowledge, there is no dataset available in this domain. As such, we propose a novel maritime 3D scene reconstruction benchmarking dataset, named as MTReD (Maritime Three-Dimensional Reconstruction Dataset). The MTReD comprises 19 fly-over videos curated from the Internet containing ships, islands, and coastlines. As the task is aimed towards geometrical consistency and visual completeness, the dataset uses two metrics: (1) Reprojection error; and (2) Perception based metrics. We find that existing perception-based metrics, such as Learned Perceptual Image Patch Similarity (LPIPS), do not appropriately measure the completeness of a reconstructed image. Thus, we propose a novel semantic similarity metric utilizing DINOv2 features coined DiFPS (DinoV2 Features Perception Similarity). We perform initial evaluation on two baselines: (1) Structured from Motion (SfM) through Colmap; and (2) the recent state-of-the-art MASt3R model. We find that the reconstructed scenes by MASt3R have higher reprojection errors, but superior perception based metric scores. To this end, some pre-processing methods are explored, and we find a pre-processing method which improves both the reprojection error and perception-based score. We envisage our proposed MTReD to stimulate further research in these directions. The dataset and all the code will be made available in https://github.com/RuiYiYong/MTReD.