InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

In multi-condition image generation, single-adapter fine-tuning (e.g., LoRA) suffers from task interference, while existing token-level Mixture-of-LoRA-Experts (MoLE) architectures exhibit spatial fragmentation and semantic drift due to misalignment between local token routing and global instruction semantics. To address this, we propose a Global Instruction-Guided Low-Rank Mixture-of-Experts (MoLE) framework. Our key contributions are: (1) an Instruction-Guided Routing (IGR) mechanism that employs the full user instruction as a global routing signal—replacing token-level routing—and (2) an output-space orthogonality loss that explicitly enforces functional diversity across experts. Implemented atop diffusion Transformers (DiTs), our method enables efficient fine-tuning. Experiments on multi-condition generation benchmarks demonstrate significant improvements over LoRA and all evaluated MoLE variants, achieving superior generation consistency, structural integrity, and instruction fidelity.

Technology Category

Application Category

📝 Abstract

Parameter-Efficient Fine-Tuning of Diffusion Transformers (DiTs) for diverse, multi-conditional tasks often suffers from task interference when using monolithic adapters like LoRA. The Mixture of Low-rank Experts (MoLE) architecture offers a modular solution, but its potential is usually limited by routing policies that operate at a token level. Such local routing can conflict with the global nature of user instructions, leading to artifacts like spatial fragmentation and semantic drift in complex image generation tasks. To address these limitations, we introduce InstructMoLE, a novel framework that employs an Instruction-Guided Mixture of Low-Rank Experts. Instead of per-token routing, InstructMoLE utilizes a global routing signal, Instruction-Guided Routing (IGR), derived from the user's comprehensive instruction. This ensures that a single, coherently chosen expert council is applied uniformly across all input tokens, preserving the global semantics and structural integrity of the generation process. To complement this, we introduce an output-space orthogonality loss, which promotes expert functional diversity and mitigates representational collapse. Extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants across challenging multi-conditional generation benchmarks. Our work presents a robust and generalizable framework for instruction-driven fine-tuning of generative models, enabling superior compositional control and fidelity to user intent.

Problem

Research questions and friction points this paper is trying to address.

Monolithic adapters cause task interference in multi-conditional image generation

Token-level routing conflicts with global instructions causing semantic drift

Existing methods lack coherent expert selection for preserving structural integrity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses instruction-guided routing for global expert selection

Applies output-space orthogonality loss for expert diversity

Enables coherent multi-conditional image generation via unified expert council

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models