The infrastructure powering IBM's Gen AI model development

📅 2024-07-07
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
To address the escalating computational demands, high costs, and cross-environment coordination challenges in large language model (LLM) training, this project develops an end-to-end hybrid-cloud AI infrastructure comprising the cloud-based multi-tenant supercomputing platform Vela and the on-premises ultra-large-scale training system Blue Vela. It introduces a novel cloud-edge collaborative dynamic resource scheduling paradigm, integrating AI-optimized hardware clusters, a full-stack software–hardware co-designed training framework, and a unified telemetry and elastic orchestration system. Compared to conventional approaches, the infrastructure achieves a 35% improvement in training efficiency at the thousand-GPU scale and reduces fault recovery time by 60%. It has successfully accelerated iterative development of IBM’s third-generation and beyond generative AI models, while delivering commercial inference services with millisecond-level latency and 99.99% system availability.

Technology Category

Application Category

📝 Abstract
AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings.
Problem

Research questions and friction points this paper is trying to address.

AI model development
computational resources
training cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vela
Blue Vela
Generative AI
🔎 Similar Papers
No similar papers found.
T
Talia Gershon
S
Seetharami R. Seelam
I
I-Hsin Chung
Apoorve Mohan
Apoorve Mohan
M
Ming-Hung Chen
Lixiang Luo
Lixiang Luo
R
Robert Walkup
C
Constantinos Evangelinos
Shweta Salaria
Shweta Salaria
IBM Thomas J. Watson Research Center
Y
Yoonho Park
L
L. Schour
A
Alim Alim
Pavlos Maniotis
Pavlos Maniotis
L
L. Schares
B
Bengi Karacali-Akyamac
S
Sophia Wen
T
Tatsuhiro Chiba
S
Sunyanan Choochotkaew
T
Takeshi Yoshimura
C
C. Misale
T
Tonia Elengikal
K
Kevin O Connor
Z
Zhuoran Liu
L
L. Schneidenbach
J
James Caden
C
Christopher Laibinis
C
Carlos Fonseca
Vasily Tarasov
Vasily Tarasov
S
S. Sundararaman
F
Frank Schmuck
S
S. Guthridge
M
Marc Eshel
R
Runyu Liu
W
W. Pointer
D
D. Wyskida
B
Bob Krull
B
Brent Wolfe
W
William Cornejo
John Walter
John Walter
C
Colm Malone
C
Clifford Perucci
F
Frank Franco
B
Bob Calio
R
R. Kilduff
J
John Kienle
M
Matthew Connolly
E
Edie Fost
G
Gina Roma
J
Jake Fonseca
Ido Levy
Ido Levy
Researcher, IBM Research
NLPAgentsEmergent CommunicationInterpretability and Explainability
M
Michele Payne
R
Ryan Schenkel
A
Amir Malki
L
Lion Schneider
A
Aniruddha Narkhede
A
Alexandra Kisin
O
Olga Dodin
B
Bill Rippon
J
John M. Ganci
Rakesh Pandey
Rakesh Pandey
A
Aditya Gidh
S
Shubham Sharma
M
Mayank Mishra
Rameswar Panda
Rameswar Panda
Distinguished Engineer, IBM Research
Computer VisionMachine LearningNatural Language Processing
A
Aditya Prasad
M
Matt Stallone
G
Gaoyuan Zhang
Yikang Shen
Yikang Shen
xAI
Deep LearningNatural Language Processing
David Cox
David Cox
VP, AI Models; IBM Director, MIT-IBM Watson AI Lab, IBM Research
Artificial IntelligenceGenerative AILarge Language Models
R
Ruchir Puri
Dakshi Agrawal
Dakshi Agrawal
D
Drew Thorstensen
B
Brent Tang
S
Saurabh Kumar Gupta
A
Amitabha Biswas
J
Jason Van Patten
M
Matthew Runion
S
Sai Kaki
S
Steve Pritko
S
Shahan Najam
S
Surya Nambala
R
Radhika Chirra
R
Rick Welp
F
Felipe Telles
A
A. Arvelo
K
King Chu
E
E. J. Seminaro
F
Felix Eickhoff
W
William Hanson
P
Piyush Chaudhary
P
Piyush Shivam
W
Wesley Jones
C
Chris Bostic
W
Wayne Sawdon
J
John Lewars
M
Michael Spriggs
G
George Gao
A
Ashish Kamra
G
Gaurav Singh
T
Tushar Katarki
J
Joe Talerico
Z
Zenghui Shi
S
Sai Sindhur Malleni
E
Erwan Gallen