Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the limitation of existing DiT-based text-to-image models that rely solely on single-layer static features from large language models (LLMs) as text encoders, which fails to capture the dynamic semantic requirements across varying timesteps and network depths in the diffusion process. To overcome this, the authors propose a unified normalized convex fusion framework that employs a lightweight gating mechanism to dynamically weight and fuse multi-layer LLM hidden states along temporal, depth-wise, and joint dimensions. The study presents the first systematic validation of depth-aware semantic routing, revealing that pure temporal fusion suffers from trajectory misalignment, leading to degraded performance. The proposed method substantially improves text-image alignment and compositional generation capabilities, achieving a 9.97-point gain on the GenAI-Bench Counting task and establishing a strong new baseline.

Technology Category

Application Category

📝 Abstract

Recent DiT-based text-to-image models increasingly adopt LLMs as text encoders, yet text conditioning remains largely static and often utilizes only a single LLM layer, despite pronounced semantic hierarchy across LLM layers and non-stationary denoising dynamics over both diffusion time and network depth. To better match the dynamic process of DiT generation and thereby enhance the diffusion model's generative capability, we introduce a unified normalized convex fusion framework equipped with lightweight gates to systematically organize multi-layer LLM hidden states via time-wise, depth-wise, and joint fusion. Experiments establish Depth-wise Semantic Routing as the superior conditioning strategy, consistently improving text-image alignment and compositional generation (e.g., +9.97 on the GenAI-Bench Counting task). Conversely, we find that purely time-wise fusion can paradoxically degrade visual generation fidelity. We attribute this to a train-inference trajectory mismatch: under classifier-free guidance, nominal timesteps fail to track the effective SNR, causing semantically mistimed feature injection during inference. Overall, our results position depth-wise routing as a strong and effective baseline and highlight the critical need for trajectory-aware signals to enable robust time-dependent conditioning.

Problem

Research questions and friction points this paper is trying to address.

Semantic Routing

Diffusion Transformers

LLM Feature Weighting

Text-to-Image Generation

Dynamic Conditioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Routing

Diffusion Transformers

Multi-layer LLM Feature Weighting