🤖 AI Summary
To address excessive memory and computational overhead in deploying large language models (LLMs) and vision-language models (VLMs), this paper proposes an activation-aware low-rank compression framework. Methodologically, it establishes, for the first time, a theoretical quantitative relationship between inter-layer activation errors and model loss variation; introduces a Pareto-optimal single-tolerance rank selection criterion, with rigorous proof of its optimality; and designs a zero-shot Pareto-Guided Singular Value Decomposition (PGSVD) pipeline that integrates activation error analysis, bi-objective optimization modeling, and alternating least squares solving. Experimental results demonstrate that, at comparable compression ratios, the method significantly improves both inference accuracy and latency across multiple large-scale LLMs and VLMs, validating its effectiveness and state-of-the-art performance.
📝 Abstract
Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.