🤖 AI Summary
A standardized evaluation protocol for continual post-training of text-to-image diffusion models is currently lacking, hindering systematic progress in this direction. Method: We introduce the first dedicated benchmark covering two practical scenarios—product customization and domain adaptation—and comprehensively evaluate methods along four dimensions: generalization preservation, task performance, catastrophic forgetting, and cross-task generalization. Our framework features a novel multi-dimensional quantitative evaluation suite, incorporating human preference modeling and vision-language question answering (VL-QA) for the first time to overcome limitations of purely automated metrics, and supports multi-stage task sequence evaluation. Contribution/Results: We systematically assess 10 state-of-the-art methods across three realistic task sequences, revealing that even oracle joint training fails to balance all metrics, with particularly weak cross-task generalization. We open-source the full dataset, codebase, and toolchain, establishing a new standard and actionable roadmap for continual text-to-image learning.
📝 Abstract
Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint"oracle"training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.