Text embedding models can be great data engineers

πŸ“… 2025-05-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Contemporary data engineering pipelines rely heavily on manual feature engineering, which is labor-intensive and lacks generalizability. To address this, we propose ADEPTβ€”a novel framework that introduces text embedding entropy as a principled information-theoretic measure for time series. ADEPT jointly optimizes textual time-series representation, pre-trained text embedding models, and the variational information bottleneck (VIB) to enable end-to-end automated time-series data engineering. It eliminates handcrafted feature design by directly extracting high-entropy representations from raw textualized time series while suppressing embedding variance to enhance robustness. Evaluated across diverse benchmarks in healthcare, finance, scientific computing, and industrial IoT, ADEPT consistently outperforms state-of-the-art methods. It demonstrates exceptional robustness to missing values, formatting errors, and irregular timestamps, significantly improving both data science efficiency and scalability.

Technology Category

Application Category

πŸ“ Abstract
Data engineering pipelines are essential - albeit costly - components of predictive analytics frameworks requiring significant engineering time and domain expertise for carrying out tasks such as data ingestion, preprocessing, feature extraction, and feature engineering. In this paper, we propose ADEPT, an automated data engineering pipeline via text embeddings. At the core of the ADEPT framework is a simple yet powerful idea that the entropy of embeddings corresponding to textually dense raw format representation of time series can be intuitively viewed as equivalent (or in many cases superior) to that of numerically dense vector representations obtained by data engineering pipelines. Consequently, ADEPT uses a two step approach that (i) leverages text embeddings to represent the diverse data sources, and (ii) constructs a variational information bottleneck criteria to mitigate entropy variance in text embeddings of time series data. ADEPT provides an end-to-end automated implementation of predictive models that offers superior predictive performance despite issues such as missing data, ill-formed records, improper or corrupted data formats and irregular timestamps. Through exhaustive experiments, we show that the ADEPT outperforms the best existing benchmarks in a diverse set of datasets from large-scale applications across healthcare, finance, science and industrial internet of things. Our results show that ADEPT can potentially leapfrog many conventional data pipeline steps thereby paving the way for efficient and scalable automation pathways for diverse data science applications.
Problem

Research questions and friction points this paper is trying to address.

Automating costly data engineering pipelines for predictive analytics
Reducing entropy variance in text embeddings of time series data
Handling missing or corrupted data in diverse applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated data engineering using text embeddings
Variational information bottleneck for entropy variance
End-to-end predictive model with text embeddings