Vision-Language Binding in In-Context Image Generation

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study investigates how reference images and text prompts jointly influence output generation in contextual image synthesis through a unified attention mechanism. Focusing on the multimodal DiT model FLUX.2, the authors employ three causal intervention techniques—T2I Lens, Attention Knockout, and I2I-to-I2I Patching—to reveal, for the first time, that text tokens, particularly filler tokens, serve as structured conduits for visual reference information. They further discover that pixel-level identity information can bypass textual mediation entirely via image-to-image attention, establishing a dual-path transmission mechanism. Experiments across 2,875 editing tasks demonstrate that attributes such as color and style are conveyed through text tokens, whereas specific instance identity relies on the image pathway, thereby clarifying the division of labor between semantic and identity information in multimodal generative models.

📝 Abstract

In-context image generation models such as FLUX.2 take a text prompt and an optional reference image as visual conditioning for the output. Internally, all three inputs -- text, reference image, and the noise tokens -- are concatenated and processed through a single attention stream, where all tokens can attend to one another. This leaves open how reference information flows through the model to produce the output image. We show that an implicit cross-modal binding emerges between the text tokens and the reference image: the text tokens absorb visual reference content during the forward pass, and that absorbed content causally influences the generated output. We surface this binding with three causal interventions on FLUX.2: T2I Lens, which decodes intermediate text-token activations through a text-to-image path; Attention Knockout, which severs specific attention edges; and I2I-to-I2I Patching, which copies text token activations between editing runs. Across 2,875 editing tasks on various images, including SUN397 and DreamBench++ datasets and images collected online, we observe a consistent division of labor: properties of the reference image, like color, style, and scene setting, are first written into the text tokens, which carry them to the generated image; pixel-exact properties like a specific face or instance identity bypass the text tokens and flow directly from reference to image through image-to-image attention. We further localize the reference-text binding to the padding tokens of the text sequence. These results show that text tokens in a multimodal DiT are not just prompt holders, but a structured channel for reference image content. More broadly, they suggest that even in unified-attention multimodal generative models, token modality structures how conditioning information is represented and routed across the network.

Problem

Research questions and friction points this paper is trying to address.

vision-language binding

in-context image generation

multimodal DiT

cross-modal attention

reference image conditioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language binding

in-context image generation

causal intervention