🤖 AI Summary
The scarcity of large-baseline stereo image data hinders the training of dedicated diffusion models for text-driven stereo generation. To address this, this work pioneers the adaptation of Stable Diffusion to stereo image synthesis. We introduce a stereo consistency reward function that jointly optimizes disparity-based geometric consistency and text-image alignment. Our method integrates efficient LoRA-based fine-tuning, reinforcement learning with reward modeling, prompt alignment loss, and dual-view consistency constraints. Extensive experiments demonstrate that our approach generates high-fidelity, geometrically coherent stereo images across diverse scenes. Quantitative and qualitative evaluations show significant improvements over existing text-to-stereo methods. Moreover, the framework exhibits strong zero-shot generalization capability without task-specific retraining.
📝 Abstract
In this paper, we propose a novel diffusion-based approach to generate stereo images given a text prompt. Since stereo image datasets with large baselines are scarce, training a diffusion model from scratch is not feasible. Therefore, we propose leveraging the strong priors learned by Stable Diffusion and fine-tuning it on stereo image datasets to adapt it to the task of stereo generation. To improve stereo consistency and text-to-image alignment, we further tune the model using prompt alignment and our proposed stereo consistency reward functions. Comprehensive experiments demonstrate the superiority of our approach in generating high-quality stereo images across diverse scenarios, outperforming existing methods.