Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Text-to-image (T2I) models still struggle with compositional semantic alignment—particularly in object-attribute binding, spatial relations, numeral comprehension, and multi-object generation. To address this, we propose the first unified evaluation framework, systematically benchmarking six representative models across fine-grained dimensions using T2I-CompBench++ and GenEval. Our analysis reveals that Vector-Quantized Autoregressive (VAR) architectures—specifically the Infinity series—significantly outperform dominant diffusion-based models in compositional generation. Notably, the parameter-efficient Infinity-2B already surpasses SDXL and PixArt-α, while Infinity-8B achieves state-of-the-art overall performance. This work establishes the first cross-architectural, fair comparison between VAR and diffusion models, empirically validating that structural priors—encoded via autoregressive token modeling—yield simultaneous gains in both generation fidelity and computational efficiency.

Technology Category

Application Category

📝 Abstract

Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt-$α$, Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs. In contrast, SDXL and PixArt-$α$ show persistent weaknesses in attribute-sensitive and spatial tasks. These results provide the first systematic comparison of VAR and diffusion approaches to compositional alignment and establish unified baselines for the future development of the T2I model.

Problem

Research questions and friction points this paper is trying to address.

Evaluating compositional alignment in text-to-image models

Comparing VAR and diffusion models across diverse benchmarks

Identifying weaknesses in attribute and spatial task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking six T2I models on compositional alignment tasks

Evaluating VAR and diffusion models across T2I-CompBench++ and GenEval

Establishing unified baselines for future T2I model development

🔎 Similar Papers

No similar papers found.