🤖 AI Summary
This work identifies pervasive benchmark overfitting in the NeurIPS 2023 LLM Efficient Tuning Competition: top-performing methods suffer a 32% performance drop on closed-source test sets, revealing severe generalization fragility in current public-benchmark-driven evaluation. To address this, the authors introduce the first reproducible two-stage (public training + closed-source validation) evaluation framework, rigorously confirming benchmark overfitting as the dominant bottleneck. They propose a novel efficient tuning paradigm centered on *data refinement*—rather than model architecture or algorithmic innovation—as the primary lever for improvement. All competition submissions, standardized fine-tuning pipelines, and Dockerized evaluation infrastructure are open-sourced. Empirical analysis demonstrates that state-of-the-art performance stems predominantly from high-quality data curation, underscoring data governance as critical to LLM generalization. This work establishes both a methodological foundation and an empirical benchmark for rethinking evaluation paradigms in generative modeling.
📝 Abstract
Our analysis of the NeurIPS 2023 large language model (LLM) fine-tuning competition revealed the following trend: top-performing models exhibit significant overfitting on benchmark datasets, mirroring the broader issue of benchmark overfitting on popular leaderboards and that data curation is essential in order to get a high performing LLM. The competition, which consisted of two stages - an open evaluation stage with publicly available tasks and a closed evaluation stage with unseen tasks - allowed us to assess the generalizability of fine-tuned LLMs. Our results highlight the limitations of current benchmark-based evaluation schemes for generative models and demonstrate the need for more robust evaluation methods. Notably, the winning submissions utilized standard open-source libraries and focused primarily on data curation. To facilitate further research and promote reproducibility, we release all competition entries, Docker files, and evaluation infrastructure, providing a valuable resource for the community to explore fine-tuning, overfitting, and reproducibility in LLMs.