🤖 AI Summary
To address memory and computational constraints in large language model (LLM) inference, the inflexibility of existing compression methods relying on static heuristics, and their inability to adapt to heterogeneous request patterns and dynamic device states, this paper proposes the first runtime-adaptive collaborative pruning framework. Our method jointly, online, and dynamically compresses both model weights and KV caches: it employs Proximal Policy Optimization (PPO)-based reinforcement learning for real-time pruning decisions; integrates KV-cache-aware analysis with differentiated pruning strategies for FFN and attention layers; and introduces a real-time memory-parameter ratio tracking mechanism. Evaluated across multiple LLMs and datasets, our framework achieves up to 37% reduction in inference latency and up to 41% decrease in peak memory usage compared to state-of-the-art methods, while maintaining accuracy degradation below 1%.
📝 Abstract
Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.