Luxical: High-Speed Lexical-Dense Text Embeddings

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

To address the trade-off between speed and flexibility in web-scale text organization, this paper proposes a lightweight “lexical-dense” embedding paradigm. Leveraging knowledge distillation, it transfers semantic capabilities from large language models into a compact architecture that jointly encodes TF-IDF–based sparse lexical features and dense representations via a small ReLU network. The resulting embeddings retain the versatility of dense vectors—supporting retrieval, clustering, classification, and data cleaning—while achieving inference speeds comparable to FastText. Experiments demonstrate 3×–100× higher throughput than neural baselines on document retrieval and LLM data cleaning tasks, with quality matching state-of-the-art embedding models. This work is the first to systematically bridge the efficiency–expressiveness gap between traditional lexical models and Transformer-based embeddings, establishing a scalable, high-performance paradigm for large-scale text preprocessing.

Technology Category

Application Category

📝 Abstract

Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed "lexical-dense" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.

Problem

Research questions and friction points this paper is trying to address.

Bridging speed and flexibility in text embedding models.

Reducing computational cost of transformer-based embeddings.

Enhancing web-scale text organization efficiency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines sparse TF-IDF features with small ReLU network

Uses knowledge distillation to approximate transformer embeddings

Achieves speedups from 3x to 100x over neural baselines

🔎 Similar Papers

No similar papers found.