AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying high-performance large language model (LLM) inference services remains challenging in resource-constrained environments—such as academic labs and SMEs—especially when leveraging heterogeneous, legacy GPU hardware (NVIDIA/AMD). Method: This paper proposes a low-overhead LLM-as-a-Service platform designed for heterogeneous legacy GPUs. It employs a software-defined AI architecture featuring a novel VRAM-aware dynamic model allocation and rescheduling mechanism, abstracting hardware heterogeneity to enable fully GPU-accelerated, CPU-fallback-free cross-vendor inference. The platform integrates secure request routing, lightweight load balancing, and distributed monitoring. Contribution/Results: To our knowledge, this is the first system enabling highly available, elastic inference of open-source LLMs on mixed legacy GPU clusters, significantly improving VRAM utilization. It offers a unified API supporting multiple models, lowering deployment barriers. Experiments show a 2.3× increase in VRAM utilization over baseline approaches, 99.8% service availability, and substantial reduction in LLMaaS operational cost.

Technology Category

Application Category

📝 Abstract
The rise of Large Language Models (LLM) has increased the need for scalable, high-performance inference systems, yet most existing frameworks assume homogeneous, resource-rich hardware, often unrealistic in academic, or resource-constrained settings. We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLMaaS) platform, that uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes, including NVIDIA and AMD devices, with a focus on fully utilizing each node's VRAM. AIvailable operates as a fully GPU-accelerated inference without CPU fallbacks, featuring a unified client interface that allows seamless interaction with all deployed LLMs through a single logical unit. The architecture comprises four main components: the Client Interface for user access, the Service Frontend for secure request routing and load balancing, the SDAI Controller for orchestration, deployment, and monitoring, and the Service Backend of heterogeneous GPU nodes executing workloads. By abstracting GPU-specific details and providing dynamic, VRAM-aware allocation and reallocation of models, AIvailable ensures efficient use of resources and resilience against failures or workload fluctuations. Targeting academic labs, private companies, and other constrained organizations, it supports diverse open LLMs helping democratize generative AI through the repurposing of legacy GPUs.
Problem

Research questions and friction points this paper is trying to address.

Enables LLM deployment on heterogeneous and legacy GPU hardware
Optimizes VRAM utilization across mixed NVIDIA and AMD devices
Provides resilient resource management for constrained academic and private settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Software-defined LLM service on heterogeneous GPUs
Unified client interface for all deployed language models
Dynamic VRAM-aware allocation across legacy GPU nodes
🔎 Similar Papers
No similar papers found.
Pedro Antunes
Pedro Antunes
LASIGE, Faculdade de Ciências, Universidade de Lisboa / University of Lisbon, Faculty of Sciences
Information SystemsDesign ScienceBusiness Process ManagementStorytellingBizDevOps
A
Ana Rita Ortigoso
Computer Science and Communication Research Centre, Polytechnic University of Leiria, Portugal
G
Gabriel Vieira
Computer Science and Communication Research Centre, Polytechnic University of Leiria, Portugal
D
Daniel Fuentes
Computer Science and Communication Research Centre, Polytechnic University of Leiria, Portugal
L
Luís Frazão
Computer Science and Communication Research Centre, Polytechnic University of Leiria, Portugal
N
Nuno Costa
Computer Science and Communication Research Centre, Polytechnic University of Leiria, Portugal
A
António Pereira
Computer Science and Communication Research Centre, Polytechnic University of Leiria, Portugal