🤖 AI Summary
Existing table-oriented instruction-tuning studies suffer from inconsistent training data, hindering rigorous isolation of architectural versus data-quality effects on model performance.
Method: This work presents the first systematic disentanglement of these two factors, applying a unified instruction-tuning pipeline and standardized evaluation protocol across Mistral, OLMo, and Phi series models—enabling fair, cross-model and cross-dataset comparisons. Our methodology integrates instruction tuning, multi-benchmark reproduction (including Hitab for table QA), cross-domain generalization testing, and joint evaluation on both table-specific and general-purpose NLP benchmarks.
Contributions/Results: We empirically uncover an inherent trade-off between table specialization and general language capability; achieve new state-of-the-art performance on Hitab; match or exceed prior table-LLMs in reproduced evaluations; and—crucially—provide the first quantitative decomposition of performance gains attributable separately to model architecture and training data quality.
📝 Abstract
Recent advances in natural language processing have leveraged instruction tuning to enhance Large Language Models (LLMs) for table-related tasks. However, previous works train different base models with different training data, lacking an apples-to-apples comparison across the result table LLMs. To address this, we fine-tune base models from the Mistral, OLMo, and Phi families on existing public training datasets. Our replication achieves performance on par with or surpassing existing table LLMs, establishing new state-of-the-art performance on Hitab, a table question-answering dataset. More importantly, through systematic out-of-domain evaluation, we decouple the contributions of training data and the base model, providing insight into their individual impacts. In addition, we assess the effects of table-specific instruction tuning on general-purpose benchmarks, revealing trade-offs between specialization and generalization.