Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the trade-off between high latency in large language models (LLMs) and low accuracy in small language models (SLMs) for line-level code completion. The authors propose MCCom, a novel framework that dynamically orchestrates local SLMs and cloud-based LLMs using user behavior signals. By integrating a lightweight 121M-parameter SLM, a two-stage speculative decoding strategy, and an iterative retrieval mechanism, MCCom achieves up to a 47.9% reduction in latency and a 46.3% decrease in LLM invocations on the RepoEval and StmtEval benchmarks, while simultaneously improving the LLM’s exact-match accuracy by 8.9%. This approach effectively balances efficiency and accuracy in real-world code completion scenarios.

Technology Category

Application Category

📝 Abstract
Line-level code completion requires a critical balance between high accuracy and low latency. Existing methods suffer from a trade-off: large language models (LLMs) provide high-quality suggestions but incur high latency, while small language models (SLMs) are fast but often suboptimal. We propose MCCom (Model-Cascading-based code Completion), a framework that cascades a local SLM with a cloud-based LLM. To achieve effective cascading, MCCom leverages user actions as a novel signal to trigger the LLM only when the SLM fails, significantly reducing cloud computation costs. Furthermore, we introduce a two-stage speculative decoding strategy and an iterative retrieval mechanism to enhance collaboration between the models. We also train a 121M-parameter lightweight model, which achieves 73.8% of the performance of a 7B state-of-the-art model. Evaluated on RepoEval and a new real-world benchmark StmtEval, MCCom reduces inference latency by up to 47.9% and LLM usage by 46.3%, while improving the LLM's exact match rate by 8.9% through effective collaboration.
Problem

Research questions and friction points this paper is trying to address.

code completion
latency
accuracy
model cascading
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

model cascading
code completion
speculative decoding
latency-accuracy trade-off
lightweight language model