Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study investigates whether any-to-any multimodal generative models exhibit superior cross-modal consistency compared to specialized models—and whether such consistency is objectively grounded or merely subjectively perceived. To this end, we introduce ACON, the first benchmark explicitly designed for evaluating cross-modal consistency, covering images, captions, editing instructions, and Q&A pairs. We further propose three principled evaluation criteria: cyclic consistency, forward equivariance, and conjugate equivariance. Experimental results show that any-to-any models do not significantly outperform specialized models on point-to-point tasks (e.g., cyclic reconstruction); however, they demonstrate weak yet statistically significant cross-modal equivariance within structured latent spaces—particularly in chained editing scenarios. Our work delineates the practical boundaries of unified architectures and establishes a novel, equivariance-centered paradigm for assessing multimodal consistency.

Technology Category

Application Category

📝 Abstract

Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously. Using three consistency criteria-cyclic consistency, forward equivariance, and conjugated equivariance-our experiments reveal that any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models in pointwise evaluations such as cyclic consistency. However, equivariance evaluations uncover weak but observable consistency through structured analyses of the intermediate latent space enabled by multiple editing operations. We release our code and data at https://github.com/JiwanChung/ACON.

Problem

Research questions and friction points this paper is trying to address.

Evaluating cross-modal consistency in any-to-any generative models

Comparing any-to-any models with specialized models on consistency metrics

Assessing latent space structure through multiple editing operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ACON dataset for cross-modal evaluation

Uses three consistency criteria for rigorous assessment

Analyzes latent space via multiple editing operations

🔎 Similar Papers

No similar papers found.