Retrieval-guided Cross-view Image Synthesis

📅 2024-11-29

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the challenges of semantic misalignment and low generation quality in cross-view image synthesis under large viewpoint disparities, this paper proposes a retrieval-guided end-to-end framework. First, a smooth and discriminative cross-view embedding space is constructed via contrastive learning. Second, a joint modeling mechanism for viewpoint-invariant and viewpoint-specific features is designed to integrate cross-view feature retrieval with embedding-space-guided generative adversarial networks. Notably, deep information retrieval is incorporated into the synthesis pipeline for the first time—without requiring auxiliary annotations such as segmentation masks. To support complex urban scene modeling under wide-area viewpoint variation, we introduce VIGOR-GEN, the first benchmark dataset tailored for large-scale cross-view synthesis. Extensive experiments on CVUSA, CVACT, and VIGOR-GEN demonstrate state-of-the-art performance: top-1 retrieval accuracy (R@1) improves by up to 12.3%, and Fréchet Inception Distance (FID) decreases by up to 18.6% compared to prior methods.

Technology Category

Application Category

📝 Abstract

Information retrieval techniques have demonstrated exceptional capabilities in identifying semantic similarities across diverse domains through robust feature representations. However, their potential in guiding synthesis tasks, particularly cross-view image synthesis, remains underexplored. Cross-view image synthesis presents significant challenges in establishing reliable correspondences between drastically different viewpoints. To address this, we propose a novel retrieval-guided framework that reimagines how retrieval techniques can facilitate effective cross-view image synthesis. Unlike existing methods that rely on auxiliary information, such as semantic segmentation maps or preprocessing modules, our retrieval-guided framework captures semantic similarities across different viewpoints, trained through contrastive learning to create a smooth embedding space. Furthermore, a novel fusion mechanism leverages these embeddings to guide image synthesis while learning and encoding both view-invariant and view-specific features. To further advance this area, we introduce VIGOR-GEN, a new urban-focused dataset with complex viewpoint variations in real-world scenarios. Extensive experiments demonstrate that our retrieval-guided approach significantly outperforms existing methods on the CVUSA, CVACT and VIGOR-GEN datasets, particularly in retrieval accuracy (R@1) and synthesis quality (FID). Our work bridges information retrieval and synthesis tasks, offering insights into how retrieval techniques can address complex cross-domain synthesis challenges.

Problem

Research questions and friction points this paper is trying to address.

Cross-view Image Synthesis

Information Retrieval

Perspective Variation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-view Image Synthesis

Information Retrieval Integration

View-aware Feature Optimization

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View