Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the sequence-length bottleneck in large language models (LLMs) caused by textualization of tabular data in few-shot learning, this paper proposes TabICL—a scalable retrieval-augmented in-context learning framework for tables. Methodologically, it introduces a novel table-structure-aware retrieval module, integrated with retrieval-guided instruction tuning, compact textual table representations, and a multi-task TabICL training paradigm—thereby overcoming native context-window limitations. Contributions include: (i) the first deep integration of retrieval into the TabICL pipeline, establishing “language as a universal interface for tabular learning” as a new paradigm; (ii) consistent performance gains across 69 mainstream tabular datasets, enabling inference on arbitrarily sized tables; and (iii) strong scalability and compatibility with diverse LLM backbones, outperforming traditional numerical models on multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Recent studies have shown that large language models (LLMs), when customized with post-training on tabular data, can acquire general tabular in-context learning (TabICL) capabilities. These models are able to transfer effectively across diverse data schemas and different task domains. However, existing LLM-based TabICL approaches are constrained to few-shot scenarios due to the sequence length limitations of LLMs, as tabular instances represented in plain text consume substantial tokens. To address this limitation and enable scalable TabICL for any data size, we propose retrieval-augmented LLMs tailored to tabular data. Our approach incorporates a customized retrieval module, combined with retrieval-guided instruction-tuning for LLMs. This enables LLMs to effectively leverage larger datasets, achieving significantly improved performance across 69 widely recognized datasets and demonstrating promising scaling behavior. Extensive comparisons with state-of-the-art tabular models reveal that, while LLM-based TabICL still lags behind well-tuned numeric models in overall performance, it uncovers powerful algorithms under limited contexts, enhances ensemble diversity, and excels on specific datasets. These unique properties underscore the potential of language as a universal and accessible interface for scalable tabular data learning.
Problem

Research questions and friction points this paper is trying to address.

Enable scalable in-context learning on tabular data
Overcome sequence length limitations in large language models
Enhance performance across diverse datasets with retrieval-augmented LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented LLMs for tabular data
Customized retrieval module integration
Retrieval-guided instruction-tuning for scalability
🔎 Similar Papers
No similar papers found.
Xumeng Wen
Xumeng Wen
MSRA
Shun Zheng
Shun Zheng
Microsoft Research Asia
LLM ReasoningAI for Industry
Z
Zhen Xu
The University of Chicago, Chicago, IL, USA
Y
Yiming Sun
University of Pittsburgh, Pittsburgh, PA, USA
J
Jiang Bian
Microsoft Research Asia, Beijing, China