Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether any-to-any multimodal generative models exhibit superior cross-modal consistency compared to specialized models—and whether such consistency is objectively grounded or merely subjectively perceived. To this end, we introduce ACON, the first benchmark explicitly designed for evaluating cross-modal consistency, covering images, captions, editing instructions, and Q&A pairs. We further propose three principled evaluation criteria: cyclic consistency, forward equivariance, and conjugate equivariance. Experimental results show that any-to-any models do not significantly outperform specialized models on point-to-point tasks (e.g., cyclic reconstruction); however, they demonstrate weak yet statistically significant cross-modal equivariance within structured latent spaces—particularly in chained editing scenarios. Our work delineates the practical boundaries of unified architectures and establishes a novel, equivariance-centered paradigm for assessing multimodal consistency.

Technology Category

Application Category

📝 Abstract
Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously. Using three consistency criteria-cyclic consistency, forward equivariance, and conjugated equivariance-our experiments reveal that any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models in pointwise evaluations such as cyclic consistency. However, equivariance evaluations uncover weak but observable consistency through structured analyses of the intermediate latent space enabled by multiple editing operations. We release our code and data at https://github.com/JiwanChung/ACON.
Problem

Research questions and friction points this paper is trying to address.

Evaluating cross-modal consistency in any-to-any generative models
Comparing any-to-any models with specialized models on consistency metrics
Assessing latent space structure through multiple editing operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ACON dataset for cross-modal evaluation
Uses three consistency criteria for rigorous assessment
Analyzes latent space via multiple editing operations
🔎 Similar Papers
No similar papers found.
Jiwan Chung
Jiwan Chung
Yonsei University
Computer VisionNLPMultimodal Learning
J
Janghan Yoon
Yonsei University
J
Junhyeong Park
Yonsei University
S
Sangeyl Lee
Yonsei University
J
Joowon Yang
Yonsei University
S
Sooyeon Park
Yonsei University
Y
Youngjae Yu
Yonsei University