Luxical: High-Speed Lexical-Dense Text Embeddings

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between speed and flexibility in web-scale text organization, this paper proposes a lightweight “lexical-dense” embedding paradigm. Leveraging knowledge distillation, it transfers semantic capabilities from large language models into a compact architecture that jointly encodes TF-IDF–based sparse lexical features and dense representations via a small ReLU network. The resulting embeddings retain the versatility of dense vectors—supporting retrieval, clustering, classification, and data cleaning—while achieving inference speeds comparable to FastText. Experiments demonstrate 3×–100× higher throughput than neural baselines on document retrieval and LLM data cleaning tasks, with quality matching state-of-the-art embedding models. This work is the first to systematically bridge the efficiency–expressiveness gap between traditional lexical models and Transformer-based embeddings, establishing a scalable, high-performance paradigm for large-scale text preprocessing.

Technology Category

Application Category

📝 Abstract
Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed "lexical-dense" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.
Problem

Research questions and friction points this paper is trying to address.

Bridging speed and flexibility in text embedding models.
Reducing computational cost of transformer-based embeddings.
Enhancing web-scale text organization efficiency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines sparse TF-IDF features with small ReLU network
Uses knowledge distillation to approximate transformer embeddings
Achieves speedups from 3x to 100x over neural baselines
🔎 Similar Papers
No similar papers found.
L
Luke Merrick
DatologyAI
A
Alex Fang
DatologyAI
A
Aldo Carranza
DatologyAI
A
Alvin Deng
DatologyAI
Amro Abbas
Amro Abbas
DatologyAI
Machine LearningNatural Language ProcessingComputer Vision
B
Brett Larsen
DatologyAI
C
Cody Blakeney
DatologyAI
D
Darren Teh
DatologyAI
D
David Schwab
DatologyAI
F
Fan Pan
DatologyAI
H
Haakon Mongstad
DatologyAI
H
Haoli Yin
DatologyAI
Jack Urbanek
Jack Urbanek
DatologyAI
Artificial Intelligence
J
Jason Lee
DatologyAI
J
Jason Telanoff
DatologyAI
J
Josh Wills
DatologyAI
K
Kaleigh Mentzer
DatologyAI
P
Paul Burstein
DatologyAI
Parth Doshi
Parth Doshi
MS in CSE, University of California San Diego
Machine LearningComputer Vision
P
Paul Burnstein
DatologyAI
Pratyush Maini
Pratyush Maini
Carnegie Mellon University
Trustworthy ML
R
Ricardo Monti
DatologyAI
R
Rishabh Adiga
DatologyAI