🤖 AI Summary
This work addresses the limited generalization of existing homography estimation methods to unseen modalities. To overcome this, the authors propose a universal synthetic training data generation strategy that produces non-aligned multimodal image pairs with realistic displacements from a single input image, enabling model training without real multimodal correspondences. The approach enhances structural fidelity while preserving texture and color diversity. Furthermore, a novel network architecture is introduced to effectively disentangle color and structural features and integrate multiscale information for improved estimation accuracy. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art approaches across various unseen modalities, validating both the efficacy of the synthetic data strategy and the robustness of the network design.
📝 Abstract
Supervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when applied to unseen modalities. To address this issue, we propose a training data synthesis method that generates unaligned image pairs with ground-truth offsets from a single input image. Our approach renders the image pairs with diverse textures and colors while preserving their structural information. These synthetic data empower the trained model to achieve greater robustness and improved generalization across various domains. Additionally, we design a network to fully leverage cross-scale information and decouple color information from feature representations, thus improving estimation accuracy. Extensive experiments show that our training data synthesis method improves generalization performance. The results also confirm the effectiveness of the proposed network.