🤖 AI Summary
This work addresses the limitations of current text-to-image model evaluation, which relies on static prompt sets and is thus prone to overfitting and benchmark contamination. The authors propose the first fully automated dynamic evaluation framework that constructs a structured visual-semantic space to decompose prompts into controllable dimensions. By integrating task-specific subspaces with a difficulty-aware sampling mechanism, the framework dynamically generates novel prompts. It further introduces prompt-conditioned pairwise comparisons, dynamic scheduling, micro-batch aggregation, and weighted Bayesian updating to enable continuous and robust model assessment. Experiments demonstrate that this approach substantially mitigates overfitting to fixed prompt sets, achieving a strong balance among cold-start convergence, discovery of new models, and long-term ranking fidelity, while supporting a stable online leaderboard.
📝 Abstract
Existing text-to-image (T2I) benchmarks largely rely on fixed prompt sets, leaving them vulnerable to overfitting and benchmark contamination once publicly released and repeatedly reused. In this work, we propose DynT2I-Eval, a fully automated dynamic evaluation framework for T2I models. It constructs a structured visual semantic space from long-form descriptions, decomposing prompts into controllable dimensions (e.g., subject, logical constraint, environment, and composition). This enables the continuous generation of fresh prompts via task-specific spaces and difficulty-aware sampling. DynT2I-Eval evaluates model performance across text alignment, perceptual quality, and aesthetics. Heterogeneous outputs are unified into prompt-conditioned pairwise comparisons, allowing a dynamic scheduler, micro-batch aggregation, and weighted Bayesian updates to maintain a stable online leaderboard despite changing prompt distributions and model injection. Experiments with independently sampled prompt streams demonstrate that continually refreshed prompts provide a robust evaluation protocol, reducing the impact of prompt-set-specific tuning. Simulations and ablations further confirm that the proposed ranking framework achieves a strong balance among cold-start convergence, late-entry discovery, and long-run ranking fidelity.