🤖 AI Summary
Large language models (LLMs) face significant challenges in efficient training and inference on single-node, especially consumer-grade, hardware due to their massive parameter counts.
Method: This paper proposes a lightweight deployment framework integrating model partitioning, distributed scheduling, and metaheuristic load balancing. Unlike existing LLM serving systems, it is resource-aware, dynamically optimizing computational graph partitioning and inter-device communication overhead, while embedding an enhanced ant colony optimization algorithm for adaptive task allocation across heterogeneous hardware.
Contribution/Results: On a single consumer PC equipped with an RTX 4090 GPU, the framework successfully deploys 7B–13B LLMs. It achieves 18.7%, 23.4%, and 31.2% higher inference throughput compared to vLLM, Text Generation Inference (TGI), and Hugging Face’s Text Generation Inference, respectively, while reducing peak GPU memory consumption by 32%. The approach provides a reproducible technical pathway for democratizing LLM deployment at the edge.
📝 Abstract
Large language models (LLM) are advanced AI systems trained on extensive textual data, leveraging deep learning techniques to understand and generate human-like language. Today's LLMs with billions of parameters are so huge that hardly any single computing node can train, fine-tune, or infer from them. Therefore, several distributed computing techniques are being introduced in the literature to properly utilize LLMs. We have explored the application of distributed computing techniques in LLMs from two angles.
egin{itemize}
item We study the techniques that democratize the LLM, that is, how large models can be run on consumer-grade computers. Here, we also implement a novel metaheuristics-based modification to an existing system.
item We perform a comparative study on three state-of-the-art LLM serving techniques. end{itemize}