Learning Visual Generative Priors without Text

📅 2024-12-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) models rely heavily on large-scale, high-quality text-image pairs, incurring prohibitive annotation and scaling costs. Method: This paper proposes a text-free, vision-only generative prior paradigm—Lumos—a self-supervised image-to-image (I2I) pretraining framework built upon a pure-vision diffusion architecture. Leveraging only unlabeled in-the-wild images, Lumos employs structure-texture disentanglement and cross-modal alignment constraints to establish I2I modeling as a more fundamental and scalable upstream visual prior than T2I. Contribution/Results: Lumos enables lightweight text-conditioned fine-tuning—requiring only 10% of paired data—and natively supports multimodal downstream tasks including 3D generation and video synthesis. Extensive experiments demonstrate that Lumos matches or surpasses state-of-the-art T2I models across image generation, image-to-3D, and image-to-video benchmarks, while drastically reducing dependence on annotated data.

Technology Category

Application Category

📝 Abstract
Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant visual generative tasks, like image-to-3D and image-to-video. Our project page is available at https://ant-research.github.io/lumos.
Problem

Research questions and friction points this paper is trying to address.

Learning visual generative priors without text dependency
Exploring image-to-image generation for texture modeling
Comparing I2I and T2I models for visual generative tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pure vision-based training framework Lumos
Self-supervised learning from in-the-wild images
Image-to-image as foundational visual prior
🔎 Similar Papers
No similar papers found.