LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models

📅 2026-01-28
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the challenge of deploying large multimodal models, which suffer from high computational and memory costs, while existing compression techniques often incur significant reconstruction error due to the decoupled application of low-rank decomposition and quantization. To overcome this limitation, the authors propose a frequency-domain joint compression framework that simultaneously performs low-rank approximation and quantization in the Fourier domain, leveraging its decorrelation and conjugate symmetry properties to yield more compact and accurate weight representations. Key innovations include PolarQuant, a polar-coordinate-based quantization method tailored for complex-valued matrices, and an optional diagonal calibration (ODC) scheme that operates without extensive calibration data. Experiments demonstrate that the proposed approach substantially reduces activation memory and computational overhead across multiple multimodal benchmarks, while maintaining or even surpassing the performance of current efficient models.

Technology Category

Application Category

📝 Abstract
Large multimodal models (LMMs) have achieved impressive performance on various vision-language tasks, but their substantial computational and memory costs hinder their practical deployment. Existing compression methods often decouple low-rank decomposition and quantization, leading to compounded reconstruction errors, especially in multimodal architectures with cross-modal redundancy. To address this issue, we propose LLaVA-FA, a novel efficient LMM that performs joint low-rank plus quantization approximation in the frequency domain. By leveraging the de-correlation and conjugate symmetry properties of Fourier transform, LLaVA-FA achieves more compact and accurate weight representations. Furthermore, we introduce PolarQuant, a polar-coordinate quantization method tailored for complex matrices, and an optional diagonal calibration (ODC) scheme that eliminates the need for large-scale calibration data. Extensive experimental results demonstrate that our proposed LLaVA-FA outperforms existing efficient multimodal models across multiple benchmarks while maintaining minimal activated parameters and low computational costs, validating its effectiveness as a powerful solution for compressing LMMs.
Problem

Research questions and friction points this paper is trying to address.

large multimodal models
model compression
low-rank decomposition
quantization
cross-modal redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fourier approximation
joint low-rank quantization
PolarQuant
frequency-domain compression
multimodal model efficiency
🔎 Similar Papers
No similar papers found.
P
Pengcheng Zheng
University of Electronic Science and Technology of China
Chaoning Zhang
Chaoning Zhang
Professor at UESTC (电子科技大学, China)
Computer VisionLLM and VLMGenAI and AIGC Detection
J
Jiarong Mo
University of Electronic Science and Technology of China
G
GuoHui Li
University of Electronic Science and Technology of China
J
Jiaquan Zhang
University of Electronic Science and Technology of China
J
Jiahao Zhang
Chengdu University of Information Technology
S
Sihan Cao
University of Electronic Science and Technology of China
Sheng Zheng
Sheng Zheng
Beijing Institute of Technology
Computer vision
C
Caiyan Qin
Harbin Institute of Technology, Shenzhen
Guoqing Wang
Guoqing Wang
University of Electronic Science and Technology of China
Computer VisionMachine LearningPattern RecognitionIntelligent System
Y
Yang Yang
University of Electronic Science and Technology of China