FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Subject-driven image generation faces a fundamental trade-off between identity fidelity and generation efficiency: fine-tuning methods incur high computational overhead, while zero-shot approaches suffer from poor subject consistency. This paper proposes a training-free cross-image feature grafting framework that enables precise transfer of reference subject details within the latent space of diffusion models. Our key contributions are: (1) an attention fusion mechanism guided by semantic matching and spatial constraints; (2) a geometry-aware noise initialization strategy to enhance robustness of cross-image feature alignment; and (3) native support for multi-subject collaborative generation. Without modifying the pre-trained diffusion model, our method significantly improves both identity preservation and text–image alignment. Quantitative and qualitative evaluations demonstrate superior overall performance compared to state-of-the-art zero-shot and training-free baselines. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance, yet existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive subject-specific optimization, while zero-shot methods fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor employs semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated image. Additionally, our framework incorporates a novel noise initialization strategy to preserve geometry priors of reference subjects for robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.

Problem

Research questions and friction points this paper is trying to address.

Balancing subject identity fidelity and generation efficiency

Eliminating need for time-consuming subject-specific model tuning

Enhancing zero-shot subject consistency without fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free cross-image feature grafting

Semantic matching and attention fusion

Noise initialization for geometry preservation

🔎 Similar Papers

EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance