Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

📅 2024-02-09

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This paper addresses the challenges of automated table retrieval, integration, and predictive modeling for multi-table machine learning tasks in data lakes. We propose an end-to-end feature engineering framework that jointly performs table join discovery, lightweight multi-table aggregation, and tree-based prediction (e.g., XGBoost). To support reproducible evaluation, we introduce YADL—a benchmark dataset derived from Open Data US and augmented with semi-synthetic tables. Our key findings are: (1) retrieval accuracy of relevant heterogeneous tables is the primary determinant of downstream prediction performance; (2) simple concatenation-based merging strategies consistently outperform complex schema alignment methods; and (3) tree models exhibit strong robustness to noise introduced during data augmentation. The framework significantly enhances the practicality and reproducibility of AutoML systems in data lake environments, bridging a critical gap between automated feature engineering and real-world multi-source tabular data.

Technology Category

Application Category

📝 Abstract

Machine-learning from a disparate set of tables, a data lake, requires assembling features by merging and aggregating tables. Data discovery can extend autoML to data tables by automating these steps. We present an in-depth analysis of such automated table augmentation for machine learning tasks, analyzing different methods for the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for benchmarking this data discovery task. Systematic exploration on both lakes outlines 1) the importance of accurately retrieving join candidates, 2) the efficiency of simple merging methods, and 3) the resilience of tree-based learners to noisy conditions. Our experimental environment is easily reproducible and based on open data, to foster more research on feature engineering, autoML, and learning in data lakes.

Problem

Research questions and friction points this paper is trying to address.

Data Integration

Predictive Analytics

Large Databases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Discovery Methods

Tabular Data Integration

Automated Machine Learning

🔎 Similar Papers

TabSketchFM: Sketch-Based Tabular Representation Learning for Data Discovery Over Data Lakes