Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) training datasets are prohibitively large, largely inaccessible, and lack auditability, hindering transparency, safety evaluation, and reproducible research. Method: This work introduces the first publicly available offline indexing system for ultra-large-scale LLM training data. Built upon the Apertus dataset, it deploys and optimizes Elasticsearch on the ARM64-based green supercomputer Alps—marking the first such deployment on ARM64—and constructs a high-performance full-text index over 8.6 trillion tokens (56.6% of the 15.2-trillion-token corpus). Contribution/Results: The system serves dual purposes—as an open web search engine and an LLM safety auditing tool—enabling fine-grained, verifiable data provenance tracing and content inspection without requiring model jailbreaking. Key innovations include an ARM64-optimized distributed indexing architecture, an energy-efficient retrieval framework for trillion-token unstructured text, and the first publicly accessible, reproducible, and auditable infrastructure for large-scale LLM training data.

Technology Category

Application Category

📝 Abstract
The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.
Problem

Research questions and friction points this paper is trying to address.

Indexing massive LLM training data for searchable access
Enabling full-text search across web-scale datasets efficiently
Creating safety tools through searchable training data indices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Full-text indexing pipeline using Elasticsearch parallel indices
Leveraging Alps infrastructure for energy-efficient arm64 supercluster
Indexing 8.6T tokens to create LLM safety tool
🔎 Similar Papers
No similar papers found.
I
Ines Altemir Marinas
IC, EPFL, Lausanne, Switzerland
A
Anastasiia Kucherenko
IEM, HES-SO Valais-Wallis, Sierre, Switzerland
A
Alexander Sternfeld
IEM, HES-SO Valais-Wallis, Sierre, Switzerland
Andrei Kucharavy
Andrei Kucharavy
Assistant Professor, HES-SO Valais-Wallis
Machine LearningEvolutionDistributed ComputationComputational Biology