🤖 AI Summary
Generating long-tail, fine-grained multi-entity interaction images remains hindered by data scarcity—particularly for rare interaction patterns—and insufficient model expressivity. To address this, we introduce InterActing, the first high-fidelity interactive prompt dataset comprising 1,000 meticulously curated samples, and propose DetailScribe: a novel framework that (i) semantically decomposes interaction concepts via an LLM, (ii) employs a vision-language model (VLM) to perform fine-grained critical assessment of intermediate diffusion outputs, and (iii) executes feature-level targeted intervention. This establishes the first “decompose–critique–intervene” three-stage reasoning paradigm for interaction-guided text-to-image controllable generation. Integrated into Stable Diffusion 3.5 with synergistic LLM and VLM components, DetailScribe significantly improves accuracy and fidelity across functional actions, spatial configurations, and multi-agent interactions. Both automated metrics and human evaluations demonstrate state-of-the-art performance.
📝 Abstract
Images not only depict objects but also encapsulate rich interactions between them. However, generating faithful and high-fidelity images involving multiple entities interacting with each other, is a long-standing challenge. While pre-trained text-to-image models are trained on large-scale datasets to follow diverse text instructions, they struggle to generate accurate interactions, likely due to the scarcity of training data for uncommon object interactions. This paper introduces InterActing, an interaction-focused dataset with 1000 fine-grained prompts covering three key scenarios: (1) functional and action-based interactions, (2) compositional spatial relationships, and (3) multi-subject interactions. To address interaction generation challenges, we propose a decomposition-augmented refinement procedure. Our approach, DetailScribe, built on Stable Diffusion 3.5, leverages LLMs to decompose interactions into finer-grained concepts, uses a VLM to critique generated images, and applies targeted interventions within the diffusion process in refinement. Automatic and human evaluations show significantly improved image quality, demonstrating the potential of enhanced inference strategies. Our dataset and code are available at https://concepts-ai.com/p/detailscribe/ to facilitate future exploration of interaction-rich image generation.