π€ AI Summary
Traditional feature engineering for high-dimensional numerical regression relies heavily on domain expertise and suffers from poor generalizability. Method: This paper systematically investigates the effectiveness and underlying mechanisms of large language model (LLM) embeddings as regression features, mapping textual inputs directly to numerical representations. We empirically establish, for the first time, that LLM embeddings inherently satisfy Lipschitz continuityβa property conducive to stable numerical prediction. Through disentangled analysis of model scale, linguistic competence, and other factors, we demonstrate that these attributes do not necessarily translate to improved regression performance. Contribution/Results: Extensive experiments across multiple real-world high-dimensional regression tasks show that LLM embeddings significantly outperform hand-crafted features. Moreover, we quantify the marginal contributions of individual model factors to predictive performance, providing both theoretical grounding and a practical paradigm for trustworthy deployment of LLM embeddings in regression settings.
π Abstract
With the rise of large language models (LLMs) for flexibly processing information as strings, a natural application is regression, specifically by preprocessing string representations into LLM embeddings as downstream features for metric prediction. In this paper, we provide one of the first comprehensive investigations into embedding-based regression and demonstrate that LLM embeddings as features can be better for high-dimensional regression tasks than using traditional feature engineering. This regression performance can be explained in part due to LLM embeddings over numeric data inherently preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of different model effects, most notably model size and language understanding, which we find surprisingly do not always improve regression performance.