π€ AI Summary
This work addresses the challenge of extracting structured tables from unstructured clipboard text, which is hindered by data scarcity and highly variable input formats. The authors propose a βlast-mileβ fine-tuning paradigm: a pretrained large language model first generates an initial table draft, which is then refined by a smaller fine-tuned language model (1Bβ24B parameters) optimized using Tree-Edit Distance-based Similarity (TEDS). This cascaded architecture achieves superior performance with only a few thousand training samples, outperforming end-to-end fine-tuning approaches by up to 0.144 TEDS points on a benchmark of 2,596 tables. The method demonstrates markedly enhanced robustness to input variations, particularly excelling in low-resource settings where data availability is limited.
π Abstract
We propose last-mile fine-tuning, or Lift, a pipeline in which a pre-trained large language model extracts an initial table from unstructured clipboard text, and a fine-tuned small language model (1B-24B parameters SLM) repairs errors in the extracted table. On a benchmark of 2,596 tables from three datasets, Lift matches or exceeds end-to-end SLM fine-tuning on tree-edit-distance-based similarity (TEDS) metric while requiring as little as 1,000 training examples - where it outperforms end-to-end fine-tuning by up to 0.144 TEDS points. We term this approach last-mile fine-tuning and show it also more robust to input format variability. Comparisons with self-debug and end-to-end fine-tuning approaches show that last-mile fine-tuning provides an attractive option when training data is limited or when robustness to input variation is sought without compromising on accuracy.