🤖 AI Summary
This work addresses the challenge in personalized image generation where existing LoRA composition methods struggle to balance content fidelity and style consistency, often resulting in entangled content-style representations, weak controllability, and unstable fusion. To overcome these limitations, the authors propose a training-free decoupled fusion framework that separates content and style subspaces through rank-constrained fine-tuning. They introduce a prompt-guided multi-branch expert encoder to enable semantically controllable adapter aggregation and incorporate a classifier-free temporal coherence guidance mechanism to enhance generation stability. This approach achieves, for the first time, retraining-free disentangled LoRA fusion that simultaneously preserves high-fidelity content and supports flexible semantic control, significantly outperforming current state-of-the-art methods.
📝 Abstract
Personalized image generation requires effectively balancing content fidelity with stylistic consistency when synthesizing images based on text and reference examples. Low-Rank Adaptation (LoRA) offers an efficient personalization approach, with potential for precise control through combining LoRA weights on different concepts. However, existing combination techniques face persistent challenges: entanglement between content and style representations, insufficient guidance for controlling elements' influence, and unstable weight fusion that often require additional training. We address these limitations through CRAFT-LoRA, with complementary components: (1) rank-constrained backbone fine-tuning that injects low-rank projection residuals to encourage learning decoupled content and style subspaces; (2) a prompt-guided approach featuring an expert encoder with specialized branches that enables semantic extension and precise control through selective adapter aggregation; and (3) a training-free, timestep-dependent classifier-free guidance scheme that enhances generation stability by strategically adjusting noise predictions across diffusion steps. Our method significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.