SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Large language models often suffer from insufficient knowledge coverage in specialized domains with scarce data. To address this challenge, this work proposes SPA, a method that leverages a small set of carefully designed prompts to generate large-scale, high-quality synthetic data for effective external knowledge injection. Relying solely on prompt engineering and data augmentation—without requiring complex training or fine-tuning—SPA demonstrates significant performance gains over multiple strong baselines in knowledge injection tasks. The results reveal that a simple prompting strategy, when combined with extensive synthetic data, can achieve remarkable effectiveness, establishing a new, highly competitive benchmark that is both efficient and difficult to surpass in this domain.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.

Problem

Research questions and friction points this paper is trying to address.

knowledge injection

large language models

synthetic data generation

data-scarce domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge injection

synthetic data generation

prompt engineering

large language models