HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing large multimodal models (LMMs) typically adopt separate visual and linguistic encoders, leading to parameter redundancy, high training costs, and suboptimal performance due to late-stage modality fusion. Method: We propose the first lightweight, native single-Transformer multimodal model that enables early visual–linguistic embedding fusion and autoregressive visual instruction following. Our approach employs a unified single-Transformer architecture and introduces a knowledge-inheritance training paradigm—initializing from pretrained LLM/VLM weights and applying distillation-based fine-tuning—to enhance convergence efficiency and representation capability. Contribution/Results: Experiments demonstrate that our model outperforms comparable single-Transformer baselines on multimodal understanding tasks, achieving performance on par with modular LMMs while reducing training resource consumption by over 40%. This work establishes a new paradigm for efficient, end-to-end multimodal modeling.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.

Problem

Research questions and friction points this paper is trying to address.

Develops a single-transformer model for multi-modal understanding.

Addresses resource-intensive and performance gaps in native LMMs.

Proposes early-fusion and efficient training for visual-textual integration.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Early-fusion LMM integrates multi-modal inputs early.

Efficient training recipe leverages pre-trained model knowledge.

Single transformer model reduces resource consumption significantly.

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs