Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning

📅 2024-08-27

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 1

career value

190K/year

🤖 AI Summary

To address the high cost, labor-intensive nature, and limited diversity of human-curated supervised fine-tuning (SFT) data, this paper proposes Instruct-SkillMix—a fully automated pipeline that leverages LLM-based metacognitive prompting to disentangle core skills embedded in instructions, followed by stochastic skill recombination to generate high-quality, diverse SFT samples. This work establishes, for the first time, a skill-disentanglement-and-random-recomposition paradigm for SFT data generation. It empirically reveals SFT’s extreme sensitivity to low-quality samples and elucidates the intrinsic cause of crowd-sourced SFT data underperformance. Using only 4K synthetically generated samples to fine-tune LLaMA-3-8B-Base, our method achieves a 42.76% win rate on AlpacaEval 2.0—competitive with state-of-the-art models such as Claude 3 Opus—while incurring a total cost under $600.

Technology Category

Application Category

📝 Abstract

We introduce Instruct-SkillMix, an automated approach for creating diverse, high quality SFT data for instruction-following. The pipeline involves two stages, each leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to extract core"skills"for instruction-following by directly prompting the model. This is inspired by ``LLM metacognition'' of Didolkar et al. (2024); (2) Data generation: uses the powerful LLM to generate (instruction, response) data that exhibit a randomly chosen pair of these skills. Here, the use of random skill combinations promotes diversity and difficulty. The estimated cost of creating the dataset is under $600. Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from Instruct-SkillMix leads to strong gains on instruction following benchmarks such as AlpacaEval 2.0, MT-Bench, and WildBench. With just 4K examples, LLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0, a level similar to frontier models like Claude 3 Opus and LLaMA-3.1-405B-Instruct. Ablation studies also suggest plausible reasons for why creating open instruction-tuning datasets via naive crowd-sourcing has proved difficult. In our dataset, adding 20% low quality answers (``shirkers'') causes a noticeable degradation in performance. The Instruct-SkillMix pipeline seems flexible and adaptable to other settings.

Problem

Research questions and friction points this paper is trying to address.

Automates creation of diverse SFT data for instruction-following

Improves LLM performance on benchmarks with minimal data

Addresses challenges in naive crowd-sourced instruction-tuning datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated diverse SFT data generation

LLM metacognition for skill extraction

Random skill combinations enhance diversity

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Applied Deep Learning PhD Research Intern, Reinforcement Learning for LLMs - Fall 2026

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Research Engineer, Post-Training - Meta Superintelligence Labs