Retrieval-guided Cross-view Image Synthesis

📅 2024-11-29
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of semantic misalignment and low generation quality in cross-view image synthesis under large viewpoint disparities, this paper proposes a retrieval-guided end-to-end framework. First, a smooth and discriminative cross-view embedding space is constructed via contrastive learning. Second, a joint modeling mechanism for viewpoint-invariant and viewpoint-specific features is designed to integrate cross-view feature retrieval with embedding-space-guided generative adversarial networks. Notably, deep information retrieval is incorporated into the synthesis pipeline for the first time—without requiring auxiliary annotations such as segmentation masks. To support complex urban scene modeling under wide-area viewpoint variation, we introduce VIGOR-GEN, the first benchmark dataset tailored for large-scale cross-view synthesis. Extensive experiments on CVUSA, CVACT, and VIGOR-GEN demonstrate state-of-the-art performance: top-1 retrieval accuracy (R@1) improves by up to 12.3%, and Fréchet Inception Distance (FID) decreases by up to 18.6% compared to prior methods.

Technology Category

Application Category

📝 Abstract
Information retrieval techniques have demonstrated exceptional capabilities in identifying semantic similarities across diverse domains through robust feature representations. However, their potential in guiding synthesis tasks, particularly cross-view image synthesis, remains underexplored. Cross-view image synthesis presents significant challenges in establishing reliable correspondences between drastically different viewpoints. To address this, we propose a novel retrieval-guided framework that reimagines how retrieval techniques can facilitate effective cross-view image synthesis. Unlike existing methods that rely on auxiliary information, such as semantic segmentation maps or preprocessing modules, our retrieval-guided framework captures semantic similarities across different viewpoints, trained through contrastive learning to create a smooth embedding space. Furthermore, a novel fusion mechanism leverages these embeddings to guide image synthesis while learning and encoding both view-invariant and view-specific features. To further advance this area, we introduce VIGOR-GEN, a new urban-focused dataset with complex viewpoint variations in real-world scenarios. Extensive experiments demonstrate that our retrieval-guided approach significantly outperforms existing methods on the CVUSA, CVACT and VIGOR-GEN datasets, particularly in retrieval accuracy (R@1) and synthesis quality (FID). Our work bridges information retrieval and synthesis tasks, offering insights into how retrieval techniques can address complex cross-domain synthesis challenges.
Problem

Research questions and friction points this paper is trying to address.

Cross-view Image Synthesis
Information Retrieval
Perspective Variation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-view Image Synthesis
Information Retrieval Integration
View-aware Feature Optimization
🔎 Similar Papers
No similar papers found.
Hongji Yang
Hongji Yang
Leicester University
Software EngineeringCreative ComputingInternet
Y
Yiru Li
Shenzhen University, Shenzhen, China
Y
Yingying Zhu
Shenzhen University, Shenzhen, China