🤖 AI Summary
This study investigates whether any-to-any multimodal generative models exhibit superior cross-modal consistency compared to specialized models—and whether such consistency is objectively grounded or merely subjectively perceived. To this end, we introduce ACON, the first benchmark explicitly designed for evaluating cross-modal consistency, covering images, captions, editing instructions, and Q&A pairs. We further propose three principled evaluation criteria: cyclic consistency, forward equivariance, and conjugate equivariance. Experimental results show that any-to-any models do not significantly outperform specialized models on point-to-point tasks (e.g., cyclic reconstruction); however, they demonstrate weak yet statistically significant cross-modal equivariance within structured latent spaces—particularly in chained editing scenarios. Our work delineates the practical boundaries of unified architectures and establishes a novel, equivariance-centered paradigm for assessing multimodal consistency.
📝 Abstract
Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously. Using three consistency criteria-cyclic consistency, forward equivariance, and conjugated equivariance-our experiments reveal that any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models in pointwise evaluations such as cyclic consistency. However, equivariance evaluations uncover weak but observable consistency through structured analyses of the intermediate latent space enabled by multiple editing operations. We release our code and data at https://github.com/JiwanChung/ACON.