🤖 AI Summary
This work addresses the lack of a systematic benchmark for tabular learning on real-world data containing strings, which has made it unclear whether specialized architectures are necessary or if effective performance can be achieved through encoding alone. To bridge this gap, the authors introduce STRABLE, the first standardized benchmark for string-aware tabular learning, comprising 108 real-world mixed-type tables. They conduct a large-scale evaluation of 445 end-to-end and modular pipelines that combine string embeddings, categorical encodings, post-processing strategies, and diverse tabular models. Results show that on category-dominated tables, simple embeddings paired with advanced tabular models are most efficient, whereas in free-text-heavy settings, encodings derived from large language models yield superior performance. STRABLE produces generalizable pipeline rankings, demonstrating its validity as a research benchmark and offering practitioners effective, low-cost solutions.
📝 Abstract
Benchmarking tabular learning has revealed the benefit of dedicated architectures, pushing the state of the art. But real-world tables often contain string entries, beyond numbers, and these settings have been understudied due to a lack of a solid benchmarking suite. They lead to new research questions: Are dedicated learners needed, with end-to-end modeling of strings and numbers? Or does it suffice to encode strings as numbers, as with a categorical encoding? And if so, do the resulting tables resemble numerical tabular data, calling for the same learners? To enable these studies, we contribute STRABLE, a benchmarking corpus of 108 tables, all real-world learning problems with strings and numbers across diverse application fields. We run the first large-scale empirical study of tabular learning with strings, evaluating 445 pipelines. These pipelines span end-to-end architectures and modular pipelines, where strings are first encoded, then post-processed, and finally passed to a tabular learner. We find that, because most tables in the wild are categorical-dominant, advanced tabular learners paired with simple string embeddings achieve good predictions at low computational cost. On free-text-dominant tables, large LLM encoders become competitive. Their performance also appears sensitive to post-processing, with differences across LLM families. Finally, we show that STRABLE is a good set of tables to study "string tabular" learning as it leads to generalizable pipeline rankings that are close to the oracle rankings. We thus establish STRABLE as a foundation for research on tabular learning with strings, an important yet understudied area.