๐ค AI Summary
To address the high cost and low efficiency of acquiring high-quality domain-specific instruction data for instruction tuning, this paper proposes a self-guided instruction generation framework. The framework introduces three key innovations: (1) a diversity-aware batch filtering mechanismโnovel in reducing redundant API calls; (2) dynamic coupling of instruction generation and model training, enabling a closed-loop optimization process driven by real-time training feedback; and (3) joint instruction-training optimization combined with an LLM-based self-generation and self-evaluation coordination mechanism. Experimental results demonstrate that, compared to conventional methods, our approach maintains instruction diversity and task coverage while improving downstream task accuracy by 5.2% and reducing instruction data generation cost by 36%.
๐ Abstract
The rapid evolution of Large Language Models (LLMs) has enabled the industry to develop various AI-based services. Instruction tuning is considered essential in adapting foundation models for target domains to provide high-quality services to customers. A key challenge in instruction tuning is obtaining high-quality instruction data. Self-Instruct, which automatically generates instruction data using ChatGPT APIs, alleviates the data scarcity problem. To improve the quality of instruction data, Self-Instruct discards many of the instructions generated from ChatGPT, even though it is inefficient in terms of cost owing to many useless API calls. To generate high-quality instruction data at a low cost, we propose a novel data generation framework, Self-Direct Instruction generation (SeDi-Instruct), which employs diversity-based filtering and iterative feedback task generation. Diversity-based filtering maintains model accuracy without excessively discarding low-quality generated instructions by enhancing the diversity of instructions in a batch. This reduces the cost of synthesizing instruction data. The iterative feedback task generation integrates instruction generation and training tasks and utilizes information obtained during the training to create high-quality instruction sets. Our results show that SeDi-Instruct enhances the accuracy of AI models by 5.2%, compared with traditional methods, while reducing data generation costs by 36%.