Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
Existing text-to-image generation models struggle to precisely control camera viewpoints through natural language. This work proposes a parameterized viewpoint token that enables viewpoint-conditioned image synthesis by jointly fine-tuning diffusion models with geometric supervision, 3D rendering, and photorealism-enhanced data. The proposed viewpoint representation disentangles geometry from appearance, generalizes to unseen object categories, and explicitly models 3D camera structure within the text-to-visual latent space. Experimental results demonstrate that the method significantly improves the accuracy of camera viewpoint control while preserving high image quality and prompt fidelity, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories. Our work shows that text-vision latent spaces can be endowed with explicit 3D camera structure, offering a pathway toward geometrically-aware prompts for text-to-image generation. Project page: https://randdl.github.io/viewtoken_control/
Problem

Research questions and friction points this paper is trying to address.

camera control
text-to-image generation
viewpoint tokens
3D camera structure
geometric representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

camera control
viewpoint tokens
text-to-image generation
3D-aware generation
factorized representation