HiPerRAG: High-Performance Retrieval Augmented Generation for Scientific Insights

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The explosive growth of scientific literature impedes efficient knowledge discovery, exacerbates redundant research, and hinders cross-disciplinary collaboration. To address these challenges, we propose a high-performance Retrieval-Augmented Generation (RAG) system scalable to tens of millions of documents. We introduce the first High-Performance Computing (HPC)-driven large-scale RAG paradigm, integrating Polaris, Sunspot, and Frontier supercomputing resources with distributed vector retrieval. We further propose Oreo, a multimodal document parsing model that significantly improves structural accuracy for complex scientific documents containing mathematical formulas and figures. Additionally, we design ColTrast, a query-aware contrastive encoding algorithm enabling late-interaction semantic alignment and joint optimization of retrieval precision. Our system achieves 90% accuracy on SciQ and 76% on PubMedQA—substantially outperforming PubMedGPT and GPT-4. It scales to thousands of GPUs and delivers millisecond-latency RAG inference over million-document corpora.

Technology Category

Application Category

📝 Abstract
The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduces significant challenges, including the high computational costs associated with parsing documents and embedding scientific knowledge, as well as the algorithmic complexity of aligning these representations with the nuanced semantics of scientific content. To address these issues, we introduce HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index and retrieve knowledge from more than 3.6 million scientific articles. At its core are Oreo, a high-throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval accuracy by using contrastive learning and late-interaction techniques. HiPerRAG delivers robust performance on existing scientific question answering benchmarks and two new benchmarks introduced in this work, achieving 90% accuracy on SciQ and 76% on PubMedQA-outperforming both domain-specific models like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million document-scale RAG workflows for unifying scientific knowledge and fostering interdisciplinary innovation.
Problem

Research questions and friction points this paper is trying to address.

Exponential growth of scientific literature causes underutilized discoveries and duplicated efforts.
Scaling RAG for millions of articles faces high computational costs and algorithmic complexity.
Aligning document representations with scientific semantics is challenging for retrieval accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-performance computing for large-scale RAG workflows
Oreo model for high-throughput multimodal document parsing
ColTrast algorithm for query-aware contrastive learning
🔎 Similar Papers
No similar papers found.
O
Ozan Gokdemir
Argonne National Laboratory, Lemont, Illinois, USA; The University of Chicago, Chicago, Illinois, USA
C
Carlo Siebenschuh
Argonne National Laboratory, Lemont, Illinois, USA; The University of Chicago, Chicago, Illinois, USA
A
Alexander Brace
Argonne National Laboratory, Lemont, Illinois, USA; The University of Chicago, Chicago, Illinois, USA
A
Azton Wells
Argonne National Laboratory, Lemont, Illinois, USA
Brian Hsu
Brian Hsu
Associate Professor, University of North Carolina at Chapel Hill
linguisticssyntaxphonology
K
Kyle Hippe
Argonne National Laboratory, Lemont, Illinois, USA; The University of Chicago, Chicago, Illinois, USA
P
Priyanka V. Setty
Argonne National Laboratory, Lemont, Illinois, USA; The University of Chicago, Chicago, Illinois, USA
Aswathy Ajith
Aswathy Ajith
University of Chicago
NLPInformation Extraction
J. Gregory Pauloski
J. Gregory Pauloski
NVIDIA (formerly at ANL and UChicago)
Computer ScienceHPCDistributed ComputingMachine LearningSystems
V
Varuni Sastry
Argonne National Laboratory, Lemont, Illinois, USA
S
Sam Foreman
Argonne National Laboratory, Lemont, Illinois, USA
Huihuo Zheng
Huihuo Zheng
Computer Scientist, Argonne National Laboratory
High Performance ComputingMachine LearningI/OData Management
Heng Ma
Heng Ma
Argonne National Laboratory
Physical ChemistryBiophysicsMachine Learning
B
Bharat Kale
Argonne National Laboratory, Lemont, Illinois, USA
Nicholas Chia
Nicholas Chia
Data Science and Learning
Large Language ModelsReinforcement LearningComplex SystemsMicrobiome
T
Thomas Gibbs
NVIDIA Inc., Santa Clara, California, USA
Michael E. Papka
Michael E. Papka
University of Illinois Chicago / Argonne National Laboratory / University of Chicago
visualizationanalysishigh performance computing
Thomas Brettin
Thomas Brettin
Argonne National Laboratory and University of Chicago
Computational Genomics
F
Francis J. Alexander
Argonne National Laboratory, Lemont, Illinois, USA
Anima Anandkumar
Anima Anandkumar
California Institute of Technology and NVIDIA
Machine Learning and Artificial Intelligence
I
Ian Foster
Argonne National Laboratory, Lemont, Illinois, USA; The University of Chicago, Chicago, Illinois, USA
Rick Stevens
Rick Stevens
Professor of Computer Science, University of Chicago
HPCBioinformaticsDistributed ComputingVisualizationCollaboration
Venkatram Vishwanath
Venkatram Vishwanath
Computer Scientist, Argonne National Laboratory
High Performance ComputingData Intensive ComputingComputer NetworksComputer ArchitectureMachine Learning
Arvind Ramanathan
Arvind Ramanathan
Argonne National Laboratory
Machine LearningComputational BiologyMolecular biophysicsenzyme catalysishigher-order statistics