🤖 AI Summary
Existing text-to-image generation models struggle to precisely control non-content imaging factors such as camera lens, sensor characteristics, viewpoint, and scene domain, thereby limiting accurate stylistic and object-specific synthesis. This work proposes MULTI, the first approach to explicitly decouple imaging factors as an independent research direction. MULTI employs a two-stage learning framework that separately models general and dataset-specific imaging factors, integrating textual inversion with ControlNets to enable factor-level editing and image generation. Evaluated on the newly introduced DF-RICO benchmark, MULTI significantly enhances controllability over imaging factors, supports flexible composition and distribution alignment, and establishes a new paradigm for controllable image synthesis.
📝 Abstract
Recent text-to-image models produce high-quality images, yet text ambiguity hinders precise control when specific styles or objects are required. There have been a number of recent works dealing with learning and composing multiple objects and patterns. However, current work focuses almost entirely on image content, overlooking imaging factors such as camera lens, sensor types, imaging viewpoints, and scenes' domain characteristics. We introduce this new challenge as Imaging Factor Disentanglement and show limitations of current approaches in the regime. We, therefore, propose the new method Multi-factor disentanglement through Textual Inversion (MULTI). It consists of two stages: in the first stage, we learn general factors, and in the second stage, we extract dataset-specific ones. This setup enables the extension of existing datasets and novel factor combinations, thereby reducing distribution gaps. It further supports modifications of specific factors and image-to-image generation via ControlNets. The evaluation on our new DF-RICO benchmark demonstrates the effectiveness of MULTI and highlights the importance of Factor Disentanglement as a new direction of research.