๐ค AI Summary
Deploying large language models on resource-constrained devices is often hindered by bottlenecks in memory, accuracy, and throughput. This work proposes a CPU-GPU heterogeneous execution architecture that places a compressed backbone model on the GPU while offloading a LoRA-style error-compensation branch to run asynchronously on the CPU. The approach incorporates a sensitivity-aware dynamic rank allocation strategy and an asynchronous compensation pipeline, effectively recovering model accuracy while substantially reducing computational overhead. Experimental results demonstrate that, compared to the purely compressed model, the method improves downstream task accuracy by up to 5.2%, and achieves up to a 10.4ร speedup in inference latency relative to the full-precision model.
๐ Abstract
LLMs often struggle with memory-constrained deployment on consumer-grade hardware due to their massive parameter sizes. While existing solutions such as model compression and offloading improve deployment feasibility, they often suffer from substantial accuracy degradation or severe throughput bottlenecks. Recent error compensation methods recover accuracy through auxiliary LoRA-style branches, and we observe that these branches are inherently amenable to offloading: they require substantial parameter storage but access only a small subset of compensation parameters during each inference step. Motivated by this opportunity, we propose HCInfer, a heterogeneous inference system that offloads residual compensation to the CPU while executing the compressed backbone on the GPU, and further introduces an asynchronous compensation pipeline and sensitivity-aware dynamic rank allocation to hide compensation overhead and maximize accuracy recovery. Experimental results show that HCInfer achieves a maximum accuracy improvement of 5.2% on downstream tasks compared to compression model and sustaining a maximum speedup of 10.4x compared to full-precision model.