CoLLM: A Unified Framework for Co-execution of LLMs Federated Fine-tuning and Inference

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

239K/year
🤖 AI Summary
This work addresses the inefficiency of conventional large language model (LLM) post-training pipelines, which decouple fine-tuning from inference, leading to resource underutilization and delayed quality improvements. To overcome this limitation, the authors propose CoLLM, a novel framework that unifies federated parameter-efficient fine-tuning (FL PEFT) and inference into a single collaborative execution pipeline. CoLLM leverages intra-replica parameter sharing, inter-replica dual-timescale scheduling, and a shadow adapter mechanism on shared edge GPU clusters to enable real-time parameter reuse and dynamic load balancing. The approach jointly optimizes long-term model quality and short-term inference efficiency, achieving up to a 3× improvement in system goodput across diverse LLMs and realistic workloads, significantly outperforming existing LLM serving systems.
📝 Abstract
As Large Language Models (LLMs) are increasingly adopted in edge intelligence to power domain-specific applications and personalized services, the quality and efficiency of the LLM post-training phase-including fine-tuning and inference, have become critical due to constrained resources. Although recent advances in federated parameter-efficient fine-tuning (FL PEFT) and low-latency inference have improved individual task performance, fine-tuning and inference are still handled as isolated workloads, which overlooks their interdependence and results in redundant deployments and delayed improvement in inference quality. To address these limitations, we introduce a new co-execution framework and instantiate it with CoLLM, a system that unifies FL PEFT and inference on shared edge replicas and model parameters. CoLLM addresses key challenges at both replica and cluster levels through: (1) an intra-replica model sharing mechanism that enables real-time model parameter reuse via unmerged inference and shadow adapter strategies; and (2) a two-timescale inter-replica coordination algorithm that adaptively balances fine-tuning and inference workloads to jointly optimize long-term model quality gains and short-term inference efficiency. Extensive evaluation across diverse LLMs and real-world traces show that CoLLM consistently outperforms state-of-the-art LLM systems, achieving up to 3x higher goodput, demonstrating its effectiveness in enabling seamless LLM post-training for edge intelligence.
Problem

Research questions and friction points this paper is trying to address.

LLM serving
fine-tuning and inference co-execution
edge intelligence
model deployment redundancy
inference quality delay
Innovation

Methods, ideas, or system contributions that make the work stand out.

co-execution
federated parameter-efficient fine-tuning
model sharing
two-timescale coordination
SLO-aware LLM serving
🔎 Similar Papers