CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration

📅 2026-03-21

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing text-to-image diffusion models are constrained by implicit supervision, making fine-grained text-image alignment challenging. This work proposes a Cross-Timestep Calibration mechanism (CTCal) that leverages reliable cross-attention alignment signals from low-noise, small timesteps to explicitly guide representation learning at high-noise, large timesteps. CTCal integrates these calibration signals with the original diffusion loss through a timestep-aware adaptive weighting strategy. Notably, this approach introduces explicit cross-timestep alignment transfer for the first time, overcoming the limitations of conventional training paradigms that rely solely on implicit supervision. The method is model-agnostic and compatible with both diffusion and flow-based generative models. Extensive evaluations on T2I-CompBench++ and GenEval demonstrate substantial improvements in alignment quality, confirming CTCal’s effectiveness and generalization across mainstream architectures such as Stable Diffusion 2.1 and SD 3.

Technology Category

Application Category

📝 Abstract

Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during training. We further propose a timestep-aware adaptive weighting to achieve a harmonious integration of CTCal and diffusion loss. CTCal is model-agnostic and can be seamlessly integrated into existing text-to-image diffusion models, encompassing both diffusion-based (e.g., SD 2.1) and flow-based approaches (e.g., SD 3). Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate the effectiveness and generalizability of the proposed CTCal. Our code is available at https://github.com/xiefan-guo/ctcal.

Problem

Research questions and friction points this paper is trying to address.

text-to-image alignment

diffusion models

cross-attention

timestep

image synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Timestep Self-Calibration

text-to-image alignment

diffusion models