RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing post-training quantization methods, which typically employ uniform bit-widths and struggle to balance accuracy and efficiency—particularly when deploying large language models on resource-constrained devices. The authors propose RAMP, a framework that leverages off-policy Soft Actor-Critic reinforcement learning to dynamically allocate mixed precision across layers under a global bit budget, incorporating per-layer activation statistics, weight characteristics, and structural information. RAMP introduces Scale Folding preprocessing and a quality-prioritized reward mechanism, significantly enhancing stability and convergence speed in sub-4-bit quantization. Notably, it achieves zero-shot transferability of quantization strategies across model families and scales: on Llama 2 7B, it attains a perplexity of 5.54 at 3.65 effective bits (3.68 GB), outperforming uniform 4-bit AWQ/GPTQ, and generalizes without retraining to Llama 2 13B and Mistral 7B while preserving 99.5% of FP16 commonsense reasoning performance across CPU, GPU, and edge devices.

Technology Category

Application Category

📝 Abstract
Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.
Problem

Research questions and friction points this paper is trying to address.

post-training quantization
mixed precision
large language models
bit-width allocation
on-device inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning
Mixed-Precision Quantization
Zero-Shot Transfer
Scale Folding
On-Device LLM Inference
🔎 Similar Papers
No similar papers found.