🤖 AI Summary
Existing text-to-image (T2I) methods lack precise control over camera pose and intrinsic parameters, limiting compositional expressiveness. This paper introduces the first end-to-end controllable T2I framework that enables photographic-grade composition control using only four fundamental camera parameters—pitch, yaw, focal length, and translation—without requiring 3D geometry, reference objects, or multi-view supervision. Our key contributions are: (1) the first large-scale dataset of 57K image-text pairs annotated with ground-truth camera calibration parameters; (2) a differentiable camera parameter encoding scheme and a parameterized diffusion embedding mechanism; and (3) synthetic data augmentation and a customized training strategy. Experiments demonstrate a 62% reduction in camera parameter prediction error and significantly superior multi-angle, multi-focal-length, and multi-translation control compared to prompt-engineering baselines. All code, models, and data are publicly released.
📝 Abstract
Images as an artistic medium often rely on specific camera angles and lens distortions to convey ideas or emotions; however, such precise control is missing in current text-to-image models. We propose an efficient and general solution that allows precise control over the camera when generating both photographic and artistic images. Unlike prior methods that rely on predefined shots, we rely solely on four simple extrinsic and intrinsic camera parameters, removing the need for pre-existing geometry, reference 3D objects, and multi-view data. We also present a novel dataset with more than 57,000 images, along with their text prompts and ground-truth camera parameters. Our evaluation shows precise camera control in text-to-image generation, surpassing traditional prompt engineering approaches. Our data, model, and code are publicly available at https://graphics.unizar.es/projects/PreciseCam2024.