TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing text-and-image-to-image (TI2I) methods often exploit reference images only locally, struggling to jointly handle multi-reference instructions and fine-grained control while relying on costly additional fine-tuning. This paper proposes a training-free TI2I framework that enables joint text and multi-reference image conditioning without any parameter updates. Our approach introduces three core innovations: (1) implicit cross-modal contextual modeling built upon the MM-DiT architecture; (2) Reference Contextual Masking, which dynamically filters irrelevant visual tokens; and (3) a Winner-Takes-All mechanism for prioritizing visual tokens during fusion. These components collectively enable high-fidelity, precisely controllable multi-reference image synthesis. We further introduce FG-TI2I Bench—the first dedicated benchmark for fine-grained TI2I evaluation—and demonstrate consistent, significant improvements over state-of-the-art methods across multiple metrics. Our framework exhibits robust generation quality and full compatibility with mainstream T2I models such as SD3.

Technology Category

Application Category

📝 Abstract

Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhances image generation by integrating image inputs with textual instructions.

Overcomes quality decline in complex, multi-image instruction scenarios.

Introduces a benchmark for evaluating Text-and-Image-to-Image generation methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free adaptation of T2I models

Reference Contextual Masking for selective sharing

Winner-Takes-All module prioritizes relevant references

🔎 Similar Papers

Unified Text-to-Image Generation and Retrieval