T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

A standardized evaluation protocol for continual post-training of text-to-image diffusion models is currently lacking, hindering systematic progress in this direction. Method: We introduce the first dedicated benchmark covering two practical scenarios—product customization and domain adaptation—and comprehensively evaluate methods along four dimensions: generalization preservation, task performance, catastrophic forgetting, and cross-task generalization. Our framework features a novel multi-dimensional quantitative evaluation suite, incorporating human preference modeling and vision-language question answering (VL-QA) for the first time to overcome limitations of purely automated metrics, and supports multi-stage task sequence evaluation. Contribution/Results: We systematically assess 10 state-of-the-art methods across three realistic task sequences, revealing that even oracle joint training fails to balance all metrics, with particularly weak cross-task generalization. We open-source the full dataset, codebase, and toolchain, establishing a new standard and actionable roadmap for continual text-to-image learning.

Technology Category

Application Category

📝 Abstract

Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint"oracle"training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.

Problem

Research questions and friction points this paper is trying to address.

Standardized evaluation protocol lacking for continual post-training

Catastrophic forgetting undermines zero-shot compositionality in adaptation

No method excels in all continual post-training dimensions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces T2I-ConBench for continual post-training evaluation

Combines automated metrics and human-preference modeling

Benchmarks ten methods across three task sequences

🔎 Similar Papers

Linear Alignment of Vision-language Models for Image Captioning