X-Fusion: Introducing New Modality to Frozen Large Language Models

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the challenge of efficiently extending multimodal capabilities of frozen large language models (LLMs), this paper proposes a dual-tower architecture that decouples visual and linguistic branches, injecting visual information solely via modality-specific adapters while keeping all LLM parameters frozen to preserve original language competence. Methodologically, it integrates cross-modal feature alignment with understanding-generation co-training. We identify three key empirical insights: (i) image-text understanding data substantially improves generation quality; (ii) denoised image preprocessing enhances overall performance; and (iii) feature alignment accelerates convergence—especially for smaller models. Experiments demonstrate that our approach consistently outperforms mainstream multimodal baselines on image–text bidirectional generation tasks. Notably, small-model convergence speed improves by over 40%, while generated image fidelity and text–image alignment are both significantly enhanced.

Technology Category

Application Category

📝 Abstract

We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Extends pretrained LLMs for multimodal tasks without altering parameters

Improves performance on image-text and text-image tasks via dual-tower design

Explores data and alignment impacts on multimodal model efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-tower design with modality-specific weights

Keeps LLM parameters frozen during integration

Improves performance on multimodal tasks

🔎 Similar Papers

IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities