Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image models rely on short prompts to generate complex images, resulting in a substantial semantic gap and poor controllability—limiting their applicability in professional settings requiring fine-grained control. To address this, we propose FIBO: (1) the first large-scale, open-source training dataset grounded in fine-grained, structured long descriptions; (2) DimFusion, a lightweight intermediate token fusion mechanism that enables efficient alignment between linguistic representations and visual generation; and (3) Text-as-a-Bottleneck Reconstruction (TaBR), a novel evaluation protocol enabling joint quantitative assessment of long-text control fidelity and image reconstructability. Experiments demonstrate that FIBO achieves state-of-the-art prompt alignment performance among open-source models, significantly improving generation quality, attribute disentanglement, and controllability under long-text guidance.

Technology Category

Application Category

📝 Abstract
Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO
Problem

Research questions and friction points this paper is trying to address.

Training text-to-image models with long structured captions for enhanced controllability
Proposing DimFusion to efficiently process extended textual inputs without increasing tokens
Introducing TaBR protocol to evaluate image reconstruction accuracy from detailed captions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training model on long structured captions
Using DimFusion to integrate lightweight LLM tokens
Introducing TaBR protocol for controllability evaluation
🔎 Similar Papers
No similar papers found.
E
Eyal Gutflaish
BRIA AI
E
Eliran Kachlon
BRIA AI
H
Hezi Zisman
BRIA AI
T
Tal Hacham
BRIA AI
N
Nimrod Sarid
BRIA AI
A
Alexander Visheratin
BRIA AI
Saar Huberman
Saar Huberman
Unknown affiliation
Computer Visiongeometric processingDeep Learning
G
Gal Davidi
BRIA AI
G
Guy Bukchin
BRIA AI
K
Kfir Goldberg
BRIA AI
Ron Mokady
Ron Mokady
Tel Aviv University
Computer VisionDeep Learning