Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address excessive memory and computational overhead in deploying large language models (LLMs) and vision-language models (VLMs), this paper proposes an activation-aware low-rank compression framework. Methodologically, it establishes, for the first time, a theoretical quantitative relationship between inter-layer activation errors and model loss variation; introduces a Pareto-optimal single-tolerance rank selection criterion, with rigorous proof of its optimality; and designs a zero-shot Pareto-Guided Singular Value Decomposition (PGSVD) pipeline that integrates activation error analysis, bi-objective optimization modeling, and alternating least squares solving. Experimental results demonstrate that, at comparable compression ratios, the method significantly improves both inference accuracy and latency across multiple large-scale LLMs and VLMs, validating its effectiveness and state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory and computing demands of large language models

Developing theoretical foundations for low-rank compression errors

Optimizing model compression via Pareto-guided rank selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise activation bounds compression error theoretically

Bi-objective optimization yields Pareto-optimal heterogeneous ranks

Pareto-guided SVD enables zero-shot activation-aware compression

🔎 Similar Papers

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models