🤖 AI Summary
To address the challenge of efficiently extending multimodal capabilities of frozen large language models (LLMs), this paper proposes a dual-tower architecture that decouples visual and linguistic branches, injecting visual information solely via modality-specific adapters while keeping all LLM parameters frozen to preserve original language competence. Methodologically, it integrates cross-modal feature alignment with understanding-generation co-training. We identify three key empirical insights: (i) image-text understanding data substantially improves generation quality; (ii) denoised image preprocessing enhances overall performance; and (iii) feature alignment accelerates convergence—especially for smaller models. Experiments demonstrate that our approach consistently outperforms mainstream multimodal baselines on image–text bidirectional generation tasks. Notably, small-model convergence speed improves by over 40%, while generated image fidelity and text–image alignment are both significantly enhanced.
📝 Abstract
We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.