PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles

๐Ÿ“… 2026-02-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing synthetic visionโ€“language pretraining methods rely on a single generator, which often introduces model-specific biases and limits feature diversity. To address this, this work proposes Polylithic, a novel approach that leverages an ensemble of generators with diverse architectures to construct high-quality synthetic data on their intersecting manifold, effectively mitigating generation artifacts. Furthermore, the method incorporates a procedurally generated hard negative curriculum and a synthetic data redistribution strategy to enhance fine-grained syntactic and compositional understanding. By prioritizing structural diversity over sheer data volume, Polylithic significantly improves data efficiency, outperforming the SynthCLIP baseline by 19.0% on a comprehensive multi-task benchmark and achieving a 9.1% gain on the SugarCrepe++ compositional reasoning benchmark.

Technology Category

Application Category

๐Ÿ“ Abstract
Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits feature diversity. In this work, we introduce PolyGen, a framework that redefines synthetic data construction by prioritizing manifold coverage and compositional rigor over simple dataset size. PolyGen employs a Polylithic approach to train on the intersection of architecturally distinct generators, effectively marginalizing out model-specific artifacts. Additionally, we introduce a Programmatic Hard Negative curriculum that enforces fine-grained syntactic understanding. By structurally reallocating the same data budget from unique captions to multi-source variations, PolyGen achieves a more robust feature space, outperforming the leading single-source baseline (SynthCLIP) by +19.0% on aggregate multi-task benchmarks and on the SugarCrepe++ compositionality benchmark (+9.1%). These results demonstrate that structural diversity is a more data-efficient scaling law than simply increasing the volume of single-source samples.
Problem

Research questions and friction points this paper is trying to address.

synthetic data
vision-language pre-training
generator bias
feature diversity
compositional understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

PolyGen
multi-generator ensembles
synthetic data
programmatic hard negatives
compositional generalization