SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge of ineffective cross-modal alignment in existing RGB-pretrained visual geometry models when applied to RGB-T (RGB-thermal) multimodal data, which degrades performance in 3D reconstruction and pose estimation. To overcome this limitation, we propose SEAR, a lightweight fine-tuning strategy that efficiently adapts large-scale RGB-pretrained visual geometry Transformers to RGB-T inputs, significantly enhancing multimodal alignment even with limited training data. The method demonstrates robustness under challenging conditions such as low illumination and heavy smoke, while incurring negligible additional inference overhead. SEAR substantially outperforms existing approaches across multiple metrics, achieving over a 29% improvement in AUC@30. Additionally, we introduce the first high-quality, multi-view, multi-temporal RGB-T dataset to support future research in this domain.

Technology Category

Application Category

📝 Abstract

Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29\% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.

Problem

Research questions and friction points this paper is trying to address.

RGB-thermal

3D reconstruction

multimodal alignment

visual geometric transformers

camera pose estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal adaptation

RGB-thermal fusion

visual geometric transformer