Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace

📅 2023-10-30
🏛️ arXiv.org
📈 Citations: 7
Influential: 1
📄 PDF
🤖 AI Summary
This work investigates the mechanisms by which instruction tuning enhances general intelligence in Chinese large language models (LLMs), focusing on how data scale, model size (7B–33B), and data construction methodology (human-authored vs. synthetic) differentially affect multidimensional capabilities—including creative writing, code generation, and logical reasoning. Method: Leveraging a 40k+ multi-capability-annotated instruction dataset, we conduct cross-domain ablation studies to isolate these factors. Contribution/Results: We first reveal that underlying capabilities evolve at independent learning paces; human-authored data remains consistently effective, whereas synthetic data exhibits a performance ceiling; and instruction data demonstrates strong cross-capability generalization. Based on these findings, we propose a quantifiable, efficiency-oriented data construction guideline. Evaluated on two public benchmarks, our approach yields significant performance gains, providing empirical evidence and methodological foundations for capability-targeted LLM optimization.
📝 Abstract
Instruction tuning is a burgeoning method to elicit the general intelligence of Large Language Models (LLMs). However, the creation of instruction data is still largely heuristic, leading to significant variation in quantity and quality across existing datasets. While some research advocates for expanding the number of instructions, others suggest that a small set of well-chosen examples is adequate. To better understand data construction guidelines, our research provides a granular analysis of how data volume, parameter size, and data construction methods influence the development of each underlying ability of LLM, such as creative writing, code generation, and logical reasoning. We present a meticulously curated dataset with over 40k instances across ten abilities and examine instruction-tuned models with 7b to 33b parameters. Our study reveals three primary findings: (i) Despite the models' overall performance being tied to data and parameter scale, individual abilities have different sensitivities to these factors. (ii) Human-curated data strongly outperforms synthetic data from GPT-4 in efficiency and can constantly enhance model performance with volume increases, but is unachievable with synthetic data. (iii) Instruction data brings powerful cross-ability generalization, as evidenced by out-of-domain evaluations. Furthermore, we demonstrate how these findings can guide more efficient data constructions, leading to practical performance improvements on two public benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Explores scaling properties of instruction tuning for Chinese LLMs.
Investigates impact of data quantity, model size, and data construction.
Identifies varying sensitivity of abilities to scaling factors.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic study of instruction tuning for Chinese LLMs
Utilization of DoIT dataset with 40,000 instruction instances
Tailored training strategies for varying ability sensitivities
🔎 Similar Papers
No similar papers found.
Chiyu Song
Chiyu Song
Zhejiang University; Westlake University
natural language processinglarge language modelstext generationchatbot
Zhanchao Zhou
Zhanchao Zhou
Ph.D. student,Westlake University & Zhejiang University
Natural Language Processing
J
Jianhao Yan
Zhejiang University, School of Engineering, Westlake University
Y
Yuejiao Fei
Zhejiang University, School of Engineering, Westlake University
Zhenzhong Lan
Zhenzhong Lan
School of Engineering, Westlake University
NLPComputer VisionMultimedia
Y
Yue Zhang
Institute of Advanced Technology, Westlake Institute for Advanced Study