Understanding LLM Embeddings for Regression

📅 2024-11-22

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Traditional feature engineering for high-dimensional numerical regression relies heavily on domain expertise and suffers from poor generalizability. Method: This paper systematically investigates the effectiveness and underlying mechanisms of large language model (LLM) embeddings as regression features, mapping textual inputs directly to numerical representations. We empirically establish, for the first time, that LLM embeddings inherently satisfy Lipschitz continuity—a property conducive to stable numerical prediction. Through disentangled analysis of model scale, linguistic competence, and other factors, we demonstrate that these attributes do not necessarily translate to improved regression performance. Contribution/Results: Extensive experiments across multiple real-world high-dimensional regression tasks show that LLM embeddings significantly outperform hand-crafted features. Moreover, we quantify the marginal contributions of individual model factors to predictive performance, providing both theoretical grounding and a practical paradigm for trustworthy deployment of LLM embeddings in regression settings.

Technology Category

Application Category

📝 Abstract

With the rise of large language models (LLMs) for flexibly processing information as strings, a natural application is regression, specifically by preprocessing string representations into LLM embeddings as downstream features for metric prediction. In this paper, we provide one of the first comprehensive investigations into embedding-based regression and demonstrate that LLM embeddings as features can be better for high-dimensional regression tasks than using traditional feature engineering. This regression performance can be explained in part due to LLM embeddings over numeric data inherently preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of different model effects, most notably model size and language understanding, which we find surprisingly do not always improve regression performance.

Problem

Research questions and friction points this paper is trying to address.

LLM embeddings for regression

High-dimensional regression tasks

Impact of model size

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM embeddings for regression

Preserving Lipschitz continuity

Quantifying model effects impact

🔎 Similar Papers

Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs