π€ AI Summary
This work addresses the semantic gap between user intent and generated images arising from the limited expressiveness of conventional text prompts. To overcome this limitation, the authors propose MultiBO, a novel approach that integrates multi-choice preference feedback with Bayesian optimization within a human-in-the-loop framework. By iteratively refining diffusion model outputs through dynamic user interaction, MultiBO progressively guides image generation toward the userβs mental target, circumventing the constraints of static textual descriptions. Experimental results involving 30 participants demonstrate that MultiBO significantly outperforms five baseline methods across quantitative metrics, achieving higher alignment between generated images and user intentions.
π Abstract
Imagine Alice has a specific image $x^\ast$ in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multiple rounds of prompting and arrives at an image $x^{p*}$. Although $x^{p*}$ is reasonably close to $x^\ast$, Alice finds it difficult to close that gap using language prompts. This paper aims to narrow this gap by observing that even after language has reached its limits, humans can still tell when a new image $x^+$ is closer to $x^\ast$ than $x^{p*}$. Leveraging this observation, we develop MultiBO (Multi-Choice Preferential Bayesian Optimization) that carefully generates $K$ new images as a function of $x^{p*}$, gets preferential feedback from the user, uses the feedback to guide the diffusion model, and ultimately generates a new set of $K$ images. We show that within $B$ rounds of user feedback, it is possible to arrive much closer to $x^\ast$, even though the generative model has no information about $x^\ast$. Qualitative scores from $30$ users, combined with quantitative metrics compared across $5$ baselines, show promising results, suggesting that multi-choice feedback from humans can be effectively harnessed for personalized image generation.