🤖 AI Summary
The compositional generalization capability of visual generative models—i.e., their ability to synthesize novel combinations of known concepts—remains poorly understood. This paper systematically investigates key architectural and training factors affecting compositional generalization in image and video generation, identifying two core drivers: (1) the discrete versus continuous nature of the training objective, and (2) the information completeness of conditional inputs regarding concept composition. To address these, we propose a hybrid optimization strategy within the MaskGIT framework: augmenting the primary discrete reconstruction objective with an auxiliary continuous target derived from Joint-Embedding Predictive Architecture (JEPA), thereby relaxing the discrete loss. We conduct controlled ablation studies to quantitatively evaluate compositional generalization. Experiments demonstrate substantial improvements in compositional generalization for discrete generative models on complex scenes. To our knowledge, this is the first work to empirically validate the effectiveness and generality of jointly optimizing discrete and continuous objectives for structured semantic synthesis.
📝 Abstract
Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.