SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency

πŸ“… 2025-02-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Multimodal large language models (MLLMs) face a fundamental trade-off between training and inference efficiency: embedding alignment (e.g., LLaVA-1.5) incurs high inference overhead, while cross-attention alignment (e.g., Flamingo) suffers from prohibitive training costs. To address this, we propose SAISAβ€”a self-attention architecture with input-space alignment. Our key contributions are: (1) the novel NAAViT mechanism, which empirically reveals substantial redundancy in attention among visual tokens; (2) elimination of intra-visual token attention, directly aligning visual features into the self-attention input space; and (3) co-designed lightweight self-attention and feed-forward networks, supporting flexible integration with diverse LLMs and vision encoders. Experiments show that SAISA reduces inference FLOPs by 66% and training budget by 26% compared to LLaVA-1.5, while achieving superior accuracy across multiple benchmarks. Ablation studies confirm strong generalization across architectures and tasks.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a trade-off between training and inference efficiency: embedding space alignment (e.g., LLaVA-1.5) is inefficient during inference, while cross-attention space alignment (e.g., Flamingo) is inefficient in training. In this paper, we compare these two architectures and identify the key factors for building efficient MLLMs. A primary difference between them lies in how attention is applied to visual tokens, particularly in their interactions with each other. To investigate whether attention among visual tokens is necessary, we propose a new self-attention mechanism, NAAViT ( extbf{N}o extbf{A}ttention extbf{A}mong extbf{Vi}sual extbf{T}okens), which eliminates this type of attention. Our pilot experiment on LLaVA-1.5 shows that attention among visual tokens is highly redundant. Based on these insights, we introduce SAISA ( extbf{S}elf- extbf{A}ttention extbf{I}nput extbf{S}pace extbf{A}lignment), a novel architecture that enhance both training and inference efficiency. SAISA directly aligns visual features with the input spaces of NAAViT self-attention blocks, reducing computational overhead in both self-attention blocks and feed-forward networks (FFNs). Using the same configuration as LLaVA-1.5, SAISA reduces inference FLOPs by 66% and training budget by 26%, while achieving superior performance in terms of accuracy. Comprehensive ablation studies further validate the effectiveness of SAISA across various LLMs and visual encoders. The code and model will be publicly available at https://github.com/icip-cas/SAISA.
Problem

Research questions and friction points this paper is trying to address.

Improves multimodal model efficiency
Reduces computational overhead
Enhances training and inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

NAAViT eliminates visual token attention
SAISA aligns visual features efficiently
SAISA reduces FLOPs and training budget
πŸ”Ž Similar Papers
No similar papers found.
Q
Qianhao Yuan
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Yanjiang Liu
Yanjiang Liu
UCAS
Yaojie Lu
Yaojie Lu
Institute of Software, Chinese Academy of Sciences
Information ExtractionLarge Language Models
H
Hongyu Lin
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
Ben He
Ben He
Professor, University of Chinese Academy of Sciences
Natural Language ProcessingInformation Retrieval
X
Xianpei Han
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
Le Sun
Le Sun
Institute of Software, CAS
information_retrievalnatural_language_processing