🤖 AI Summary
To address the high memory footprint and low computational efficiency of RNN-based large language models (e.g., RWKV) when deployed on resource-constrained edge devices such as mobile robots and smartphones, this paper proposes the first end-to-end model compression framework tailored for the RWKV architecture. The framework integrates lightweight architectural redesign, structured pruning, post-training quantization, and knowledge distillation. It achieves 3.4×–5× memory reduction with negligible accuracy degradation. Moreover, compared to Transformer-based models of equivalent accuracy, the compressed RWKV models require approximately 4× less memory. This work constitutes the first systematic solution to the lightweight deployment challenge for RWKV-style models, delivering a reproducible, highly compatible technical pathway for efficient large-model inference on edge devices.
📝 Abstract
To deploy LLMs on resource-contained platforms such as mobile robots and smartphones, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) has shown strong computational efficiency; nevertheless, RWKV models still have high parameter counts which limited their deployment. In this paper, we propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture. Combined, our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy; compared to transformer LLMs with similar accuracy, our models require 4x less memory footprint.