FIRST: Federated Inference Resource Scheduling Toolkit for Scientific AI Model Access

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-performance computing (HPC) environments lack secure, scalable, and distributed AI inference services tailored for scientific computing. Method: This paper introduces the first federated large language model (LLM) inference platform designed specifically for HPC. It features a cluster-agnostic, OpenAI-compatible API supporting multiple backends (e.g., vLLM); integrates Globus Auth/Compute for cross-domain identity management and function-level scheduling; and proposes a novel “hot-node” retention mechanism with automated scaling to jointly optimize low-latency interactive inference and high-throughput batch processing. Contribution/Results: Deployed on production HPC systems, the platform generates over one billion tokens daily without reliance on commercial cloud infrastructure. It delivers cloud-like AI inference capabilities—complete with enhanced security, broad accessibility, and improved resource utilization—for scientific LLMs on HPC for the first time.

Technology Category

Application Category

📝 Abstract
We present the Federated Inference Resource Scheduling Toolkit (FIRST), a framework enabling Inference-as-a-Service across distributed High-Performance Computing (HPC) clusters. FIRST provides cloud-like access to diverse AI models, like Large Language Models (LLMs), on existing HPC infrastructure. Leveraging Globus Auth and Globus Compute, the system allows researchers to run parallel inference workloads via an OpenAI-compliant API on private, secure environments. This cluster-agnostic API allows requests to be distributed across federated clusters, targeting numerous hosted models. FIRST supports multiple inference backends (e.g., vLLM), auto-scales resources, maintains"hot"nodes for low-latency execution, and offers both high-throughput batch and interactive modes. The framework addresses the growing demand for private, secure, and scalable AI inference in scientific workflows, allowing researchers to generate billions of tokens daily on-premises without relying on commercial cloud infrastructure.
Problem

Research questions and friction points this paper is trying to address.

Enabling private AI inference across distributed HPC clusters
Providing scalable access to diverse AI models on-premises
Supporting parallel inference workloads via federated scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated toolkit for distributed AI model inference
OpenAI-compliant API enabling parallel inference workloads
Auto-scaling with hot nodes for low-latency execution
🔎 Similar Papers
No similar papers found.